monarch-initiative / omim

Data ingest pipeline for OMIM.
7 stars 3 forks source link

Gene association representation in omim.ttl file #156

Open twhetzel opened 4 days ago

twhetzel commented 4 days ago

I think this is a bug, if not please explain the design decision. For some OMIM disease entries, e.g. https://omim.org/entry/613659, in the omim.ttl file there is only 1 disease to gene association ('has material basis in germline mutation in' IL1B). However, on the OMIM entry page there are 11. I do see all associations in the various data files that are downloaded when creating the omim.ttl file using python3 -m omim2obo.

This issue is not limited to OMIM disease/phenotype entries that contain an INCLUDED entry since this also happens with https://omim.org/entry/605074 and the omim.ttl file only contains 1 disease to gene association ('has material basis in germline mutation in' PRCC).

I believe this is part of the issues that were reported in the PR for the OMIM g2d pipeline in Mondo, specifically point (3) Genes should not be added if the OMIM record is associated with multiple genes.

UPDATE: For https://omim.org/entry/613659, I looked in the ttl file directly vs. from Protege and do see 11 entries in this format where the values like _:Nbcf6a815046747ee9fe5bc8f3891b1c5 look to point back to a gene:

_:Nbcf6a815046747ee9fe5bc8f3891b1c5 a owl:Restriction ;
    owl:onProperty RO:0004013 ; 
    owl:someValuesFrom OMIM:613659 .

In 10 of the 11 entries, it has owl:onProperty RO:0003302 ;. None have RO:0004003 as displayed in Protege.

UPDATE 2: I do see that the OMIM gene entry with RO:0004013 is for IL1B and there is some code that flips RO:0004013 to RO:0004003 so those further transformations are more clear.

LATEST QUESTION --> However, it's not clear why only this one gene has RO:0004013 to start with and the others listed for 'gastric cancer' have a different RO property.


Also, did anyone have a chance to document the earlier design decisions? See https://github.com/monarch-initiative/omim/issues/75#issuecomment-1320976773

matentzn commented 4 days ago

@joeflack4 let me know if I can help with anything - since you updated that code recently its probably best you take care of this

joeflack4 commented 4 days ago

I know that if there’s more than 1 association, we don’t call it causal. But I don’t know why we would not list all associations otherwise. Will look into it.

joeflack4 commented 3 days ago

@sabrinatoro @matentzn Just want to confirm how this is supposed to work (Trish edit: for modeling OMIM in the first file created to model the OMIM content which is omim.ttl)

Gastric Cancer (OMIM:613659) has 11 Phenotype-Gene Relationships.

In this case, we should declare the following property on all 11:

But neither of the following properties should be used at all:

sabrinatoro commented 3 days ago

@joeflack4 We are talking about Mondo, right? (ie NOT the Monarch KG. --- I need to mention this in case I am confusing myself). In the case of MONDO: because Mondo is an ontology and all axioms have to be correct 100% of the time, the only gene annotations that we bring in are the one when the genes are part of defining the disease. The only gene-related properties we allow in Mondo, coming from OMIM is: "has material basis in germline mutation in".

Therefore, we allow only 1 gene per disease (because we know that in OMIM, the disease is defined based on variation in that gene). If a disease is associated with more than one gene, then the genes are not defining the disease, and therefore we do not bring this gene annotation into Mondo.

We documented in multiple places, I don't have time to look for the links, sorry.

Note: The 11 Phenotype-Gene Relationships for Gastric Cancer (OMIM:613659) would get into the Monarch KG, but NOT into Mondo

twhetzel commented 3 days ago

@sabrinatoro Joe's question is related to how OMIM should initially be modeled as an ontology, e.g. omim.ttl, as the content exists in OMIM itself. What we do with it from there, ie processing of omim.ttl to bring into Mondo, involves further steps that are out of scope for this question currently.

The way this initial modeling of omim looks like in the omim.ttl file is that even entries like https://omim.org/entry/613659 for 'gastric cancer' has only 1 gene association viewable in Protege (the other 10 are viewable in the ttl file when viewing using a text editor), while OMIM itself has 11 associations. Here is a screenshot of 'gastric cancer' in the omim.ttl file. While what is viewable in Protege vs. the ttl file itself is not that important, it's not clear why only 1 of 11 the genes listed in OMIM for the 'gastric cancer' entry has the association RO:0004013 which is then later converted to RO:0004003. Is there a flag in the OMIM entry/files that are used to create this association that determines that IL1B is the causal gene out of the other 10 genes that are listed or is this representation in the omim.ttl file incorrect?

My concern is that if the initial modeling of OMIM content in the omim.ttl file is not correct, the further transformations that occur to get this content into Mondo will also not be correct since the starting content is incorrect. This is related to your (Sabrina's comments) about issues with the omim pipeline/gene2disease pipeline for Mondo.

Screenshot 2024-10-24 at 3 23 34 PM

twhetzel commented 3 days ago

FYI - there is now a thread in Slack in mondo-ingest about this too.

twhetzel commented 2 days ago

Joe and I reviewed this further and my suspicion is that there is an issue in how associations are counted, therefore leading to incorrect application of the RO property in the omim.ttl file. More to come soon.