Closed matentzn closed 10 months ago
medium-low priority.
By the way, I know that Chris' pipeline used MedGenIDMappings.txt
. But there were some issues with the medgen.sssom.tsv
that came out the other side. I can detail what those issues are if you like.
I was originally going to use his pipeline and that medgen.sssom.tsv
to create the robot template, but instead I went directly to the template from MedGenIDMappings.txt
.
If the SSSOM file is still useful and you need nothing more from the MedGen team other than for them to keep MedGenIDMappings.txt
updated, I could easily transform it into an SSSOM TSV.
But there were some issues with the medgen.sssom.tsv that came out the other side. I can detail what those issues are if you like.
Would like to hear em!
Thanks the rest sounds reasonable. Both sssom and ROBOT template are derived from the same file, right?
medgen.sssom.tsv
Both sssom and ROBOT template are derived from the same file, right?
Sort of MedGenIDMappings.txt
is an input to both.
ROBOT template
is created using MedGenIDMappings.txt
as a direct input, and that is its only input.
medgen.sssom.tsv
is created at the end of Chris' pipeline, involving steps like: inputs --> medgen.obo
--> medgen-disease-extract.owl
--> obographs format --> medgen.sssom.tsv
, w/ some more intermediate steps as well. There might be other inputs though which provide more mappings (proxy or direct, I'm not sure. See 'problem 2' below.
medgen.sssom.tsv
problemsWould like to hear em! (some issues with the medgen.sssom.tsv) Looks like I did take some good notes about this.
Example:
MedGenIDMappings.txt
:
C0003886|Arthrogryposis multiplex congenita|MONDO:0015168|MONDO|
medgen.sssom.tsv
:
MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:MONDO_0008383 semapv:UnspecifiedMatching
Possible causes: Gonna guess it's some bug in the Perl.
MedGenIDMappings.txt
I guess this is not a problem for us unless it has more Mondo mappings, which I don't think it does (but I don't know if I checked). But it does have more of other kinds of mappings.
Possible causes:
I just investigated (1), but I'm really not sure. There's MedGen_HPO_Mapping.txt
and MedGen_HPO_OMIM_Mapping.txt
, and I haven't analyzed whether these are (a) subsets of MedGenIDMappings.txt
, or (b) disjoint. Further, it seems that most of the extra mappings in medgen.sssom.tsv
are not HPO or OMIM. Maybe they come from parsing one of the RRF files? Or maybe Chris has introduced some kind of proxy mappings using Mondo as a proxy? I'm just still not sure where they're coming from.
Example using randomly chosen C0003873
:
```tsv MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref GTR:AN1449912 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref GTR:AN1449922 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref GTR:AN1449923 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref HP:0001370 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MEDGEN:2078 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MESH:D001172 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1608580 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1608581 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1608582 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1616314 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1616315 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1631304 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:MONDO_0008383 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref NCIT:C2884 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref OMIM:180300 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref OMIM:607218 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref Orphanet:284130 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref SCTID:69896004 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass MEDGENCUI:C0085574 Palindromic rheumatism semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass MEDGENCUI:C2931281 Sjögren-Mikulicz syndrome semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass MEDGENCUI:C5544815 rhupus syndrome semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass UMLS:C0085574 Palindromic rheumatism semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass UMLS:C2931281 Sjögren-Mikulicz syndrome semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass UMLS:C5544815 rhupus syndrome semapv:UnspecifiedMatching ```
``` C0003873|Rheumatoid arthritis|180300|OMIM| C0003873|Rheumatoid arthritis|2078|MedGen| C0003873|Rheumatoid arthritis|69896004|SNOMEDCT_US| C0003873|Rheumatoid arthritis|D001172|MeSH| C0003873|Rheumatoid arthritis|HP:0001370|HPO| C0003873|Rheumatoid arthritis|MONDO:0008383|MONDO| C0003873|Rheumatoid arthritis|Orphanet_284130|Orphanet| ```
AN
This occurs for multiple ontologies. Seen it with MONDO, and HP I think. Probably more.
Example: MONDO:AN1713620
How it appears in medgen.sssom.tsv
:
MEDGENCUI:C0003886 Arthrogryposis oboInOwl:hasDbXref MONDO:AN1713620 semapv:UnspecifiedMatching
Possible matches in MedGenIDMappings.txt
:
No exact match on AN1713620
.
If I remove AN
from it I locate only one line:
C5397243|Perimembranous outlet ventricular septal defect with anteriorly malaligned outlet septum|1713620|MedGen|
However, problems:
MedGenIDMappings.txt
, but a MONDO ID in medgen.sssom.tsv
Gonna guess it's some bug in the Perl.
did you report this to MedGen? :)
- Has more mappings than what's in MedGenIDMappings.txt
Is the stuff in the sssom file a strict subset?
MedGenIDMappings.txt seems to contain some irrelevant stuff we don't care about like GTR, MONDO:AN..
- IDs in CURIEs starting with AN
Definitely remove these and get clarity from medgen how to deal with!
Just to summarize, my very long comment above is on about 2 ways we get get mappings: (a) robot template, or (b) the medgen.sssom.tsv
generated from Chris's pipeline. Currently we're doing 'a'. The reason for that is that 'b' has many problems.
Joe: Gonna guess it's some bug in the Perl.
Nico: did you report this to MedGen? :)
This is about Chris' pipeline; it includes the Perl he wrote and results in outputs including medgen.sssom.tsv
.
Joe: Has more mappings than what's in
MedGenIDMappings.txt
Nico: Is the stuff in the sssom file a strict subset?
It's generated from medgen-disease-extract.owl
. So yes, a subset.
Which confuses me that it has more than MedGenIDMappings.txt
. I'd have to look deeper if we want to see why. Either he has some very strange bugs, or maybe he's parsing out several of these compressed RRF files that have more mappings?
If we intend to continue with this ingest, maybe the easiest thing for us is to simply double check with Megan: "Should the only mappings we care about be located in MedGenIDMappings.txt
?"
Nico:
MedGenIDMappings.txt
seems to contain some irrelevant stuff we don't care about like GTR, MONDO:AN..
No need to worry about non-Mondo stuff in that file; the robot template I've generated for us extracts only the Mondo mappings.
MedGenIDMappings.txt
has neither GTR terms, nor these terms that start with AN
. That's only coming from Chris' medgen.sssom.tsv
for some reason.
Joe: IDs in CURIEs starting with AN
Nico: Definitely remove these and get clarity from medgen how to deal with!
See above.
This can be closed if you agree @joeflack4
Alright. I know we're still doing some related work with Megan. We can keep on top of that in that issue instead, and we can close this one.
https://ftp.ncbi.nlm.nih.gov/pub/medgen/ seems to have a bunch of interesting mapping files, lets ingest them as sssom:
EG https://ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz HPO mappings ORDO mappings etc
these are super useful for our pipelines later on