monarch-initiative / omim

Data ingest pipeline for OMIM.
7 stars 3 forks source link

OMIM gene equivalencies cause cycles/cliques #45

Closed matentzn closed 1 year ago

matentzn commented 2 years ago

HGNC-OMIM mappings (where OMIM ids pertain to genes) are represented as owl:equivalentClass axioms in omim ingest. We should probably change that to using skos:exactMatch instead (or drop them entirely, see below and instead produce an SSSOM file for HGNC-OMIM), as the OMIM-HGNC mappings are not entirely safe, or rather, they cause proxy-merges on the HGNC side, like in this case:

image

@cmungall I noticed this when running robot reason:

ERROR Only equivalent classes that have been asserted are allowed. Inferred equivalencies are forbidden.
ERROR Equivalence: <http://purl.obolibrary.org/obo/NCIT_C179199> == <http://purl.obolibrary.org/obo/NCIT_C1127>
ERROR Equivalence: <http://omim.org/entry/306250> == <http://omim.org/entry/425000>
ERROR Equivalence: <http://omim.org/entry/430000> == <http://omim.org/entry/308385>
ERROR Equivalence: <http://omim.org/entry/312095> == <http://omim.org/entry/465000>
ERROR Equivalence: <http://omim.org/entry/312865> == <http://omim.org/entry/400020>
ERROR Equivalence: <http://omim.org/entry/400023> == <http://omim.org/entry/300357>
ERROR Equivalence: <http://omim.org/entry/300162> == <http://omim.org/entry/400011>
ERROR Equivalence: <http://omim.org/entry/147070> == <http://omim.org/entry/147010>
ERROR Equivalence: <http://omim.org/entry/146910> == <http://omim.org/entry/147070>
ERROR Equivalence: <http://omim.org/entry/609517> == <http://omim.org/entry/610067>
ERROR Equivalence: <http://omim.org/entry/300015> == <http://omim.org/entry/402500>
ERROR Equivalence: <http://omim.org/entry/403000> == <http://omim.org/entry/300151>
ERROR Equivalence: <http://omim.org/entry/146910> == <http://omim.org/entry/147010>
ERROR Equivalence: <http://omim.org/entry/313470> == <http://omim.org/entry/450000>

Our policy should probably be:

cmungall commented 2 years ago

This is fine. Just merge the OMIM entries, they are talking about the same gene in all cases I bet

I looked at the first pair

  1. https://omim.org/entry/306250 COLONY-STIMULATING FACTOR 2 RECEPTOR, ALPHA; CSF2RA
  2. https://omim.org/entry/425000 GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR RECEPTOR, ALPHA SUBUNIT, Y-CHROMOSOMAL; CSF2RY

The first is an actual useful entry linked to a mendelian diseas

the second is old:


▼ REFERENCES Gough, N. M., Gearing, D. P., Nicola, N. A., Baker, E., Pritchard, M., Callen, D. F., Sutherland, G. R. Localization of the human GM-CSF receptor gene to the X-Y pseudoautosomal region. Nature 345: 734-736, 1990. [PubMed: 1972780, related citations] [Full Text]

Creation Date:Victor A. McKusick : 9/14/1992 Edit History:mimadm : 3/11/1994


HGNC doesn't acknowledge the existence of a CSF2RY

I suggest curators spend 20 mins looking at all 14 pairs and compile some gene merge requests for omim

if you need to progress in the interim, just turn off robot equivalence failing and merge these, ideally take the most recent one as primary

or simply drop OMIM gene IDs that don't connect to diseases, we don't care about these

cmungall commented 2 years ago

or make a gene exclusion list, just like we do for diseases

matentzn commented 2 years ago
joeflack4 commented 2 years ago

Roger that. Just made a new release: https://github.com/monarch-initiative/omim/releases/tag/latest

@matentzn Just wanted to let you know that my GitHub action will generate omim.ttl, but not omim.sssom.tsv.

My make all command includes this sssom command below. But I had to change the GitHub action to do make build (which doesn't include the sssom step) instead of make all, because this depends on robot (I should really have it do sh run.sh robot).

sssom:
    robot convert -i omim.ttl -o omim.json
    sssom parse omim.json -I obographs-json -m data/metadata.sssom.yml -o omim.sssom.tsv

For now, will run this command and add this file manually.

matentzn commented 2 years ago

Dont worry about creating the sssom file here - if it is really just a parse with sssom-py, we will do it over at https://github.com/monarch-initiative/mondo-ingest

Thanks!

joeflack4 commented 2 years ago

Alrighty, sounds good to me~

matentzn commented 2 years ago

@joeflack4 In the latest release I still see these equivalencies:

OMIM:146910 OMIM:147010 OMIM:147070 OMIM:146910 OMIM:147070 OMIM:147010 OMIM:300015 OMIM:402500 OMIM:300151 OMIM:403000 OMIM:300162 OMIM:400011 OMIM:300357 OMIM:400023 OMIM:312095 OMIM:465000 OMIM:400020 OMIM:312865 OMIM:425000 OMIM:306250 OMIM:430000 OMIM:308385 OMIM:450000 OMIM:313470 OMIM:609517 OMIM:610067

You can simply check these by running

robot reason -i omim.ttl --equivalent-classes-allowed asserted-only

Can you ensure that these are reflected by the raw data, so we can tell Nicole to file an inquiry with OMIM?

Thank you! Nico

joeflack4 commented 2 years ago

@matentz wow, my bad. I saw the subtask "to run new OMIM release" and just did that without reading back up to the rest of the issue.

Was a quick and easy change. I ran that robot command before the change, and saw those classes. Then I made the skos change, and now that command produces no output, as expected. New omim.ttl is in the root. Running the command to create a new release now.

edit: new release here: https://github.com/monarch-initiative/omim/releases/tag/2022-06-06

matentzn commented 2 years ago

Thank you @joeflack4: would you say the issue is permanently resolved (i.e. short of a bug in the OMIM data, it won't come back)?

joeflack4 commented 2 years ago

@matentzn Yep, it's fully resolved. It was just a one line change I needed to make. My script was explicitly labeling these as owl:equivalentClass. I just changed that to good old skos:exactMatch, so this should not come up again in the future.

matentzn commented 2 years ago

Ok. However, this is just hiding the issue though from the reasoner - they will still exist in the SSSOM files. Can you remind me about the outcome of your discussion with OMIM: should these cycles exist at all? Did they claim that such cycles can occasionally happen, and if so, why?

joeflack4 commented 2 years ago

@matentzn Now I think I'm understanding more what this issue is about.

In my opinion, we are not making the correct inference here. I was looking at my code comments, and back in November I wrote a comment where I was confused why we / Dazhi had chosen owl:equivalentClass for these (OMIM entry)::(HGNC Gene entry) relationships. I agree that the same kind of logical issue exists if we use skos:exactMatch as well.

I think that we should use some other kind of semantic relationship. This is sort of at the edge of my knowledge/expertise, though. @cmungall Can you recommend something? I'm looking at biolink. Does it make sense for us to use biolink:gene?


@matentzn If I show you the originating OMIM data structures, you will see that it doesn't seem right for us to say that these MIM term are equivalent based on a proxy relationship to an HGNC term.

Looking at the first pair from the robot report: ERROR Equivalence: <http://omim.org/entry/306250> == <http://omim.org/entry/425000>

mim2gene.txt We can see that they both do have a relationship to a common gene.

MIM Number MIM Entry Type  Entrez Gene ID (NCBI) Approved Gene Symbol (HGNC) Ensembl Gene ID (Ensembl)
306250 gene 1438 CSF2RA ENSG00000198223
425000 gene 1438 CSF2RA ENSG00000198223

mimTitles.txt But we can see that these MIM terms represent different things by looking at their titles:

Prefix MIM Number Preferred Title; symbol Alternative Title(s); symbol(s) Included Title(s); symbols
Asterisk 306250 COLONY-STIMULATING FACTOR 2 RECEPTOR, ALPHA; CSF2RA GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR RECEPTOR, LOW AFFINITY, ALPHA SUBUNIT; GMCSFR
Asterisk 425000 GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR RECEPTOR, ALPHA SUBUNIT, Y-CHROMOSOMAL; CSF2RY

So I think this is not an OMIM data quality issue. This is just an issue with how we are representing these relationships. @matentzn If I'm correct about that, we might want to rename this issue accordingly.

genemap2.txt There's even more information in genemap2. Maybe someone more versed in bio might be able to better understand the differences between these MIM terms than I.

Chromosome  Genomic Position Start  Genomic Position End    Cyto Location   Computed Cyto Location  MIM Number  Gene Symbols    Gene Name   Approved Gene Symbol    Entrez Gene ID  Ensembl Gene ID Comments    Phenotypes  Mouse Gene Symbol/ID
chrX    1268813 1325217 Xp22.32 Xp22.33 306250  CSF2RA, SMDP4   Colony-stimulating factor-2 receptor, alpha, low-affinity, granulocyte-macrophage   CSF2RA  1438    ENSG00000198223 order in PAR: pter-CSF2RA-IL3RA-ANT3-ASMT-MIC2-cen  Surfactant metabolism dysfunction, pulmonary, 4, 300770 (3)
chrY    1268813 1325217 Yp11    Yp11.2  425000  CSF2RY  Granulocyte-macrophage colony-stimulating factor receptor, alpha subunit (Y chromosome) CSF2RA  1438    ENSG00000198223 306250 = X homolog; distal to MIC2Y     Csf2ra (MGI:1339754)

Btw, I/we never asked the OMIM people about this (yet). The HGNC thing that we had asked then about was on a different topic; it was about inconsistent mappings between MIM terms and HGNC symbols. The conclusion of that discussion is that when possible, we should use MIM term to HGNC ID mappings, because symbols change (Sabrina knows more about this stuff).

matentzn commented 2 years ago

Hmm, nice analysis, but... not sure. @cmungall seems to disagree if you read his comment above. I think these OMIM ids are in fact genes, and we should ask OMIM to look at them. Since there are only so few I mean..

joeflack4 commented 2 years ago

Ohh, I'm sorry. It's been awhile since I looked at his comment. He already largely addressed this / looked into my same example. He evidently has a better understanding about this than I.

I agree with Chris' suggestions as to how to handle this, i.e. one thing we can do is have curators look at these and potentially make merge requests. If you want me to generate a report / table w/ information about all of these MIM entries to help w/ that process, just let me know what fields you would like in that. And I can provide contact info for the OMIM folks if needed.

matentzn commented 2 years ago

Excellent! Just a table with the current conflicts would be great, basically all cases where two hgnc points to one omim or two omim point to the same hgnc.

Thanks!

joeflack4 commented 2 years ago

Link to the initial CSV and discussion is in pull request: https://github.com/monarch-initiative/omim/pull/64#discussion_r899300524