Closed matentzn closed 1 year ago
This is fine. Just merge the OMIM entries, they are talking about the same gene in all cases I bet
I looked at the first pair
The first is an actual useful entry linked to a mendelian diseas
the second is old:
▼ REFERENCES Gough, N. M., Gearing, D. P., Nicola, N. A., Baker, E., Pritchard, M., Callen, D. F., Sutherland, G. R. Localization of the human GM-CSF receptor gene to the X-Y pseudoautosomal region. Nature 345: 734-736, 1990. [PubMed: 1972780, related citations] [Full Text]
Creation Date:Victor A. McKusick : 9/14/1992 Edit History:mimadm : 3/11/1994
HGNC doesn't acknowledge the existence of a CSF2RY
I suggest curators spend 20 mins looking at all 14 pairs and compile some gene merge requests for omim
if you need to progress in the interim, just turn off robot equivalence failing and merge these, ideally take the most recent one as primary
or simply drop OMIM gene IDs that don't connect to diseases, we don't care about these
or make a gene exclusion list, just like we do for diseases
Roger that. Just made a new release: https://github.com/monarch-initiative/omim/releases/tag/latest
@matentzn Just wanted to let you know that my GitHub action will generate omim.ttl
, but not omim.sssom.tsv
.
My make all
command includes this sssom
command below. But I had to change the GitHub action to do make build
(which doesn't include the sssom
step) instead of make all
, because this depends on robot
(I should really have it do sh run.sh robot
).
sssom:
robot convert -i omim.ttl -o omim.json
sssom parse omim.json -I obographs-json -m data/metadata.sssom.yml -o omim.sssom.tsv
For now, will run this command and add this file manually.
Dont worry about creating the sssom file here - if it is really just a parse with sssom-py, we will do it over at https://github.com/monarch-initiative/mondo-ingest
Thanks!
Alrighty, sounds good to me~
@joeflack4 In the latest release I still see these equivalencies:
OMIM:146910 OMIM:147010 OMIM:147070 OMIM:146910 OMIM:147070 OMIM:147010 OMIM:300015 OMIM:402500 OMIM:300151 OMIM:403000 OMIM:300162 OMIM:400011 OMIM:300357 OMIM:400023 OMIM:312095 OMIM:465000 OMIM:400020 OMIM:312865 OMIM:425000 OMIM:306250 OMIM:430000 OMIM:308385 OMIM:450000 OMIM:313470 OMIM:609517 OMIM:610067
You can simply check these by running
robot reason -i omim.ttl --equivalent-classes-allowed asserted-only
Can you ensure that these are reflected by the raw data, so we can tell Nicole to file an inquiry with OMIM?
Thank you! Nico
@matentz wow, my bad. I saw the subtask "to run new OMIM release" and just did that without reading back up to the rest of the issue.
Was a quick and easy change. I ran that robot
command before the change, and saw those classes. Then I made the skos
change, and now that command produces no output, as expected. New omim.ttl
is in the root. Running the command to create a new release now.
edit: new release here: https://github.com/monarch-initiative/omim/releases/tag/2022-06-06
Thank you @joeflack4: would you say the issue is permanently resolved (i.e. short of a bug in the OMIM data, it won't come back)?
@matentzn Yep, it's fully resolved. It was just a one line change I needed to make. My script was explicitly labeling these as owl:equivalentClass
. I just changed that to good old skos:exactMatch
, so this should not come up again in the future.
Ok. However, this is just hiding the issue though from the reasoner - they will still exist in the SSSOM files. Can you remind me about the outcome of your discussion with OMIM: should these cycles exist at all? Did they claim that such cycles can occasionally happen, and if so, why?
@matentzn Now I think I'm understanding more what this issue is about.
In my opinion, we are not making the correct inference here. I was looking at my code comments, and back in November I wrote a comment where I was confused why we / Dazhi had chosen owl:equivalentClass
for these (OMIM entry)::(HGNC Gene entry) relationships. I agree that the same kind of logical issue exists if we use skos:exactMatch
as well.
I think that we should use some other kind of semantic relationship. This is sort of at the edge of my knowledge/expertise, though. @cmungall Can you recommend something? I'm looking at biolink. Does it make sense for us to use biolink:gene
?
@matentzn If I show you the originating OMIM data structures, you will see that it doesn't seem right for us to say that these MIM term are equivalent based on a proxy relationship to an HGNC term.
Looking at the first pair from the robot
report:
ERROR Equivalence: <http://omim.org/entry/306250> == <http://omim.org/entry/425000>
mim2gene.txt
We can see that they both do have a relationship to a common gene.
MIM Number MIM Entry Type Entrez Gene ID (NCBI) Approved Gene Symbol (HGNC) Ensembl Gene ID (Ensembl)
306250 gene 1438 CSF2RA ENSG00000198223
425000 gene 1438 CSF2RA ENSG00000198223
mimTitles.txt
But we can see that these MIM terms represent different things by looking at their titles:
Prefix MIM Number Preferred Title; symbol Alternative Title(s); symbol(s) Included Title(s); symbols
Asterisk 306250 COLONY-STIMULATING FACTOR 2 RECEPTOR, ALPHA; CSF2RA GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR RECEPTOR, LOW AFFINITY, ALPHA SUBUNIT; GMCSFR
Asterisk 425000 GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR RECEPTOR, ALPHA SUBUNIT, Y-CHROMOSOMAL; CSF2RY
So I think this is not an OMIM data quality issue. This is just an issue with how we are representing these relationships. @matentzn If I'm correct about that, we might want to rename this issue accordingly.
genemap2.txt
There's even more information in genemap2
. Maybe someone more versed in bio might be able to better understand the differences between these MIM terms than I.
Chromosome Genomic Position Start Genomic Position End Cyto Location Computed Cyto Location MIM Number Gene Symbols Gene Name Approved Gene Symbol Entrez Gene ID Ensembl Gene ID Comments Phenotypes Mouse Gene Symbol/ID
chrX 1268813 1325217 Xp22.32 Xp22.33 306250 CSF2RA, SMDP4 Colony-stimulating factor-2 receptor, alpha, low-affinity, granulocyte-macrophage CSF2RA 1438 ENSG00000198223 order in PAR: pter-CSF2RA-IL3RA-ANT3-ASMT-MIC2-cen Surfactant metabolism dysfunction, pulmonary, 4, 300770 (3)
chrY 1268813 1325217 Yp11 Yp11.2 425000 CSF2RY Granulocyte-macrophage colony-stimulating factor receptor, alpha subunit (Y chromosome) CSF2RA 1438 ENSG00000198223 306250 = X homolog; distal to MIC2Y Csf2ra (MGI:1339754)
Btw, I/we never asked the OMIM people about this (yet). The HGNC thing that we had asked then about was on a different topic; it was about inconsistent mappings between MIM terms and HGNC symbols. The conclusion of that discussion is that when possible, we should use MIM term to HGNC ID mappings, because symbols change (Sabrina knows more about this stuff).
Hmm, nice analysis, but... not sure. @cmungall seems to disagree if you read his comment above. I think these OMIM ids are in fact genes, and we should ask OMIM to look at them. Since there are only so few I mean..
Ohh, I'm sorry. It's been awhile since I looked at his comment. He already largely addressed this / looked into my same example. He evidently has a better understanding about this than I.
I agree with Chris' suggestions as to how to handle this, i.e. one thing we can do is have curators look at these and potentially make merge requests. If you want me to generate a report / table w/ information about all of these MIM entries to help w/ that process, just let me know what fields you would like in that. And I can provide contact info for the OMIM folks if needed.
Excellent! Just a table with the current conflicts would be great, basically all cases where two hgnc points to one omim or two omim point to the same hgnc.
Thanks!
Link to the initial CSV and discussion is in pull request: https://github.com/monarch-initiative/omim/pull/64#discussion_r899300524
HGNC-OMIM mappings (where OMIM ids pertain to genes) are represented as owl:equivalentClass axioms in omim ingest. We should probably change that to using skos:exactMatch instead (or drop them entirely, see below and instead produce an SSSOM file for HGNC-OMIM), as the OMIM-HGNC mappings are not entirely safe, or rather, they cause proxy-merges on the HGNC side, like in this case:
@cmungall I noticed this when running robot reason:
Our policy should probably be: