monarch-initiative / medgen

MedGen ingest.
1 stars 0 forks source link

Add mappings to medgen #6

Closed matentzn closed 10 months ago

matentzn commented 1 year ago

https://ftp.ncbi.nlm.nih.gov/pub/medgen/ seems to have a bunch of interesting mapping files, lets ingest them as sssom:

EG https://ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz HPO mappings ORDO mappings etc

these are super useful for our pipelines later on

matentzn commented 1 year ago

medium-low priority.

joeflack4 commented 1 year ago

By the way, I know that Chris' pipeline used MedGenIDMappings.txt. But there were some issues with the medgen.sssom.tsv that came out the other side. I can detail what those issues are if you like.

I was originally going to use his pipeline and that medgen.sssom.tsv to create the robot template, but instead I went directly to the template from MedGenIDMappings.txt.

If the SSSOM file is still useful and you need nothing more from the MedGen team other than for them to keep MedGenIDMappings.txt updated, I could easily transform it into an SSSOM TSV.

matentzn commented 1 year ago

But there were some issues with the medgen.sssom.tsv that came out the other side. I can detail what those issues are if you like.

Would like to hear em!

Thanks the rest sounds reasonable. Both sssom and ROBOT template are derived from the same file, right?

joeflack4 commented 1 year ago

Inputs: robot template vs medgen.sssom.tsv

Both sssom and ROBOT template are derived from the same file, right?

Sort of MedGenIDMappings.txt is an input to both.

ROBOT template is created using MedGenIDMappings.txt as a direct input, and that is its only input.

medgen.sssom.tsv is created at the end of Chris' pipeline, involving steps like: inputs --> medgen.obo --> medgen-disease-extract.owl --> obographs format --> medgen.sssom.tsv, w/ some more intermediate steps as well. There might be other inputs though which provide more mappings (proxy or direct, I'm not sure. See 'problem 2' below.

medgen.sssom.tsv problems

Would like to hear em! (some issues with the medgen.sssom.tsv) Looks like I did take some good notes about this.

1. MONDO:MONDO_ instances

Example: MedGenIDMappings.txt: C0003886|Arthrogryposis multiplex congenita|MONDO:0015168|MONDO|

medgen.sssom.tsv: MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:MONDO_0008383 semapv:UnspecifiedMatching

Possible causes: Gonna guess it's some bug in the Perl.

2. Has more mappings than what's in MedGenIDMappings.txt

I guess this is not a problem for us unless it has more Mondo mappings, which I don't think it does (but I don't know if I checked). But it does have more of other kinds of mappings.

Possible causes:

  1. Maybe these mappings are coming from another input file
  2. Some bug?

I just investigated (1), but I'm really not sure. There's MedGen_HPO_Mapping.txt and MedGen_HPO_OMIM_Mapping.txt, and I haven't analyzed whether these are (a) subsets of MedGenIDMappings.txt, or (b) disjoint. Further, it seems that most of the extra mappings in medgen.sssom.tsv are not HPO or OMIM. Maybe they come from parsing one of the RRF files? Or maybe Chris has introduced some kind of proxy mappings using Mondo as a proxy? I'm just still not sure where they're coming from.

Example using randomly chosen C0003873:

MedGenIDMappings.txt

```tsv MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref GTR:AN1449912 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref GTR:AN1449922 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref GTR:AN1449923 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref HP:0001370 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MEDGEN:2078 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MESH:D001172 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1608580 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1608581 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1608582 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1616314 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1616315 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:AN1631304 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref MONDO:MONDO_0008383 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref NCIT:C2884 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref OMIM:180300 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref OMIM:607218 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref Orphanet:284130 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis oboInOwl:hasDbXref SCTID:69896004 semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass MEDGENCUI:C0085574 Palindromic rheumatism semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass MEDGENCUI:C2931281 Sjögren-Mikulicz syndrome semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass MEDGENCUI:C5544815 rhupus syndrome semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass UMLS:C0085574 Palindromic rheumatism semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass UMLS:C2931281 Sjögren-Mikulicz syndrome semapv:UnspecifiedMatching MEDGENCUI:C0003873 Rheumatoid arthritis owl:equivalentClass UMLS:C5544815 rhupus syndrome semapv:UnspecifiedMatching ```

medgen.sssom.tsv

``` C0003873|Rheumatoid arthritis|180300|OMIM| C0003873|Rheumatoid arthritis|2078|MedGen| C0003873|Rheumatoid arthritis|69896004|SNOMEDCT_US| C0003873|Rheumatoid arthritis|D001172|MeSH| C0003873|Rheumatoid arthritis|HP:0001370|HPO| C0003873|Rheumatoid arthritis|MONDO:0008383|MONDO| C0003873|Rheumatoid arthritis|Orphanet_284130|Orphanet| ```

3. IDs in CURIEs starting with AN

This occurs for multiple ontologies. Seen it with MONDO, and HP I think. Probably more.

Example: MONDO:AN1713620

How it appears in medgen.sssom.tsv: MEDGENCUI:C0003886 Arthrogryposis oboInOwl:hasDbXref MONDO:AN1713620 semapv:UnspecifiedMatching

Possible matches in MedGenIDMappings.txt: No exact match on AN1713620.

If I remove AN from it I locate only one line: C5397243|Perimembranous outlet ventricular septal defect with anteriorly malaligned outlet septum|1713620|MedGen|

However, problems:

  1. This is a MedGen ID in MedGenIDMappings.txt, but a MONDO ID in medgen.sssom.tsv
  2. It is a different match. Different MedGenCUI and disease name between the 2 files.
matentzn commented 1 year ago

Gonna guess it's some bug in the Perl.

did you report this to MedGen? :)

  1. Has more mappings than what's in MedGenIDMappings.txt

Is the stuff in the sssom file a strict subset?

MedGenIDMappings.txt seems to contain some irrelevant stuff we don't care about like GTR, MONDO:AN..

  1. IDs in CURIEs starting with AN

Definitely remove these and get clarity from medgen how to deal with!

joeflack4 commented 1 year ago

Just to summarize, my very long comment above is on about 2 ways we get get mappings: (a) robot template, or (b) the medgen.sssom.tsv generated from Chris's pipeline. Currently we're doing 'a'. The reason for that is that 'b' has many problems.

Joe: Gonna guess it's some bug in the Perl.

Nico: did you report this to MedGen? :)

This is about Chris' pipeline; it includes the Perl he wrote and results in outputs including medgen.sssom.tsv.

Joe: Has more mappings than what's in MedGenIDMappings.txt

Nico: Is the stuff in the sssom file a strict subset?

It's generated from medgen-disease-extract.owl. So yes, a subset. Which confuses me that it has more than MedGenIDMappings.txt. I'd have to look deeper if we want to see why. Either he has some very strange bugs, or maybe he's parsing out several of these compressed RRF files that have more mappings? If we intend to continue with this ingest, maybe the easiest thing for us is to simply double check with Megan: "Should the only mappings we care about be located in MedGenIDMappings.txt?"

Nico: MedGenIDMappings.txt seems to contain some irrelevant stuff we don't care about like GTR, MONDO:AN..

No need to worry about non-Mondo stuff in that file; the robot template I've generated for us extracts only the Mondo mappings. MedGenIDMappings.txt has neither GTR terms, nor these terms that start with AN. That's only coming from Chris' medgen.sssom.tsv for some reason.

Joe: IDs in CURIEs starting with AN

Nico: Definitely remove these and get clarity from medgen how to deal with!

See above.

matentzn commented 10 months ago

This can be closed if you agree @joeflack4

joeflack4 commented 10 months ago

Alright. I know we're still doing some related work with Megan. We can keep on top of that in that issue instead, and we can close this one.