monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

GARD : Mondo mapping #283

Open sabrinatoro opened 1 year ago

sabrinatoro commented 1 year ago

I reviewed https://docs.google.com/spreadsheets/d/1w5Xnzr5uNFcPrQqCT8mGBFHnGhwXDfQVzHKofw6kB7c/edit#gid=244617092

1) the created mapping (when there is a MONDO:id) are correct for all the one I checked examples:

subject_id in_gard in_mondo status GARD->Orphanet GARD->Orphanet->OMIM GARD->Orphanet->Mondo
GARD:10001 TRUE FALSE gard Orphanet:53689 OMIM:214700 MONDO:0008964
GARD:10045 TRUE FALSE gard Orphanet:79303 OMIM:235555 MONDO:0009339
GARD:10061 TRUE FALSE gard Orphanet:50809 OMIM:609655 MONDO:0012330

2) in the cases where multiple OMIM are associated for 1 GARD and/or 1 Orphanet: These OMIMs id represents children terms of the Mondo term represented by the corresponding GARD/ORDO. (Note that the Mondo term is mapped to GARD/Orphanet correctly for the terms I checked). Therefore, if one needs to make a mapping between omim and gard, these omim ids would represent more specific concept than GARD. (omim = skos:narrowMatch of GARD ??I can never remember if it is the correct way to look at it??) Examples:

subject_id in_gard in_mondo status GARD->Orphanet GARD->Orphanet->OMIM GARD->Orphanet->Mondo
GARD:10413 TRUE FALSE gard Orphanet:730 OMIM:600666|OMIM:613095|OMIM:173900 MONDO:0004691
GARD:156 TRUE FALSE gard Orphanet:588 OMIM:613153|OMIM:613150|OMIM:613154|OMIM:615350|OMIM:253800|OMIM:615181|OMIM:236670|OMIM:253280 MONDO:0018939

3) I noticed that some terms do not have a Mondo mapping. Most (maybe all?) of these have the same orphanet and omim ID (the number is the same for the orphanet and omim id). I think the orphanet IDs are incorrect because I couldn't find them in orphanet. The OMIM ID do match correctly to the Mondo term for which this omim id is an exact match Examples:

subject_id in_gard in_mondo status GARD->Orphanet GARD->Orphanet->OMIM GARD->Orphanet->Mondo
GARD:15246 TRUE FALSE gard Orphanet:277600 OMIM:277600  
GARD:15247 TRUE FALSE gard Orphanet:278150 OMIM:278150  
GARD:15248 TRUE FALSE gard Orphanet:615511 OMIM:615511  
GARD:15240 TRUE FALSE gard Orphanet:273800 OMIM:273800  

(the orphanet ID is incorrect and does not exist; the omim ids are correctly matched to the gard ids and could be used for mapping to Mondo)

4) I haven't seen an example of the following, but it might be a good "sanity check" to ensure that the Mondo corresponding to the orphanet ID and to the OMIM ID (when there is only one) are identical. If they are, I think we can adopt the mappings safely. If they are not, it would require review (and would probably due to change in parentage in OMIM or splitting of omim terms in mondo that are not represented in orphanet)

Please let me know if you have any questions. @matentzn @joeflack4

matentzn commented 1 year ago

@sabrinatoro this is an excellent analysis, Thanks a ton. @joeflack4 the most pressing issue is to figure out 3 - this must be a mistake in the data send to other, but double check if it really is a mistake or not.

joeflack4 commented 1 year ago

Thank you @sabrinatoro! I sent some questions about some confusions I was having about their data source, but I should have spotted this. I followed up on Figure 3, and I found out that there are actually a ton of OMIM mappings that I mistakenly thought were ORDO mappings. I think we will get up to ~2,600 new mappings via GARD::OMIM::Mondo, which could bring our total mappings to about ~10,000/12,410. I will re-run this analysis soon!

joeflack4 commented 1 year ago

@sabrinatoro I think the main concerns in this issue have been addressed, though perhaps we want to keep it open just until we're sure that mappings are good? I have not updated anything that would require the spreadsheet to change, but I did create a proper GARD SSSOM and put it in releases: https://github.com/monarch-initiative/gard/releases

matentzn commented 1 year ago

Great work Joe. The subject and Object categories in your sssom files are not well formed - they should be proper pipe separated lists, e.g. Disease|Morphol..

matentzn commented 1 year ago

@sabrinatoro I think you can close this issue when

sabrinatoro commented 1 year ago

@joeflack4 @matentzn Thank you! I finally have more time and will review this now. I have on my list that I should review the 160 ids that are missing and check whether I can map them. Is that correct? (in addition to general check)

matentzn commented 1 year ago

I would be interested to see why they are not mapped, since we are supposedly on par with Orphanet / OMIM. So when you do the 160, try to map the corresponding OMIM / Orphanet IDs at the same time (or provide us with feedback why we had excluded them)

joeflack4 commented 1 year ago

@matentzn wrote:

Great work Joe. The subject and Object categories in your sssom files are not well formed - they should be proper pipe separated lists, e.g. Disease|Morphol..

Roger that, will fix that!

sabrinatoro commented 1 year ago

There are 175 gard terms that are unmapped in Mondo

@matentzn @nicolevasilevsky @joeflack4 working spreadsheet : https://docs.google.com/spreadsheets/d/1DTedQy0B19XtDWa9O0WePL5SztHL_R_HzphoAwywwwE/edit#gid=0

joeflack4 commented 1 year ago

Great progress Sabrina!

QUESTION: I am assuming that these will be x-ref to Mondo via omim/orphanet automatically when we re-run the pipeline. Is that correct?

I think so. If I run the full pipeline, it will download a new mondo.sssom.tsv from http://purl.obolibrary.org/obo/mondo/mappings/mondo.sssom.tsv, so as long as that gets updated, we're good.

matentzn commented 1 year ago

Figuring whats up with the 83 is highest priority! These must be traced through the entire pipeline - make sure that you do not only look at your exclusion list, but the extensional exclusion list:

https://github.com/monarch-initiative/mondo-ingest/blob/main/src/ontology/reports/ordo_exclusion_reasons.robot.template.tsv

sabrinatoro commented 1 year ago

They are not in the extensional exclusion list either.

extensional exclusion list: https://github.com/monarch-initiative/mondo-ingest/blob/main/src/ontology/reports/ordo_exclusion_reasons.robot.template.tsv

Is this the "full" list (e.g. including all the children terms to also be excluded)?

matentzn commented 1 year ago

Spooky!

sabrinatoro commented 1 year ago

Spooky!

I KNOW!!! :-D I cannot find these anywhere... I am reviewing with @nicolevasilevsky as maybe she has them in any of her lists... but I am very surprised that they are not in any of the exclusion lists or the slurp/mapping lists.

nicolevasilevsky commented 1 year ago

👻

I'll look into it and report back

nicolevasilevsky commented 1 year ago

I am unsure why these aren't on the ORDO exclusion list but I think all of them can be.

matentzn commented 1 year ago

We have to make sure. @joeflack4 can you please identify 3 terms that are not excluded an not in Mondo and trace them through the system so we know what they are not suggested by the migration code?

joeflack4 commented 1 year ago

Sabrina:

Is this the "full" list (e.g. including all the children terms to also be excluded)?

Yep!

@matentzn I'll do that this week.

joeflack4 commented 1 year ago

@sabrinatoro @matentzn (copy/pasting my PR comment here)

Take a look at the ordo_mapping_status-with_slurp_info.tsv that I put in GoogleSheets.

I have a codebook worksheet that explains the columns and makes it clearer what's considered migratable/slurpable:

Field Meaning
possible_slurp_candidate This means that the slurp script was initially considering these as candidates to check for slurping. It means that they are not mapped, not excluded, and not deprecated. The next step after this is checking its parents.
parent_obsolete | delimited list of True/False as to whether or not the parent has been obsoleted in Mondo.
qualified_parent_mondo_ids | delimited list of qualified parents. For it to be "qualified", that means it must be (i) mapped to Mondo as a skos:exactMatch or skos:narrowMatch, and (ii) not obsoleted in Mondo.
actual_slurp_candidate This means that the term is migratable/slurpable and will appear in slurp/*.tsv. For a term to be migratable, it must:(i) meet the conditions of 'possible_slurp_candidate', but also (ii) Either (a) it has no parents, or (b) - ALL its parents must be mapped, and - must have 1+ qualified_parent_mondo_ids among those parents (meaning at least one must be an skos:exactMatch or skos:NarrowMatch and not obsoleted in Mondo).

Findings Based on this analysis, it looks like the reasons we have so many ORDO terms that are not excluded and not mapped but yet are not showing up as slurpable is that 100% of them (n=117) have parents that are all marked obsolete in Mondo.

joeflack4 commented 1 year ago

@matentzn @sabrinatoro Questions

  1. Docs: Given the intricate logic involved with determining what it means to be an "actual_slurp_candidate", would it help if we updated the docs with what I've written here? All we've got written there is how to run the workflow and what it means at a high level.
  2. *_mapping_status-with_slurp_info.tsv: If this would be helpful in the future, I could make this a regular output.
  3. Further updates if requested i. Unmapped parents and missing info: If a term has any unmapped parents, it is designated as not migratable/slurpable; not an "actual_slurp_candidate". In my outputs, I gave a lot of information about parents, but for these unmapped parents I was unable to fetch their label or obsolete status within Mondo. If you'd like I can fetch the label / obsolete status from the corresponding ontology. ii. Parent status for is_deprecated,is_excluded: I included whether or not the parent was mapped, but if its helpful I can add columns for these as well.
matentzn commented 1 year ago

Ahhh, ok! This is clear. Having a parent that is obsolete is not a reason to be excluded in the slurp. A mapped parent that is obsolete can be ignored by the has parent check. If there are no other parents, then the migration should proceed in the same way as if there are no parents at all.

matentzn commented 1 year ago

I absolutely think documenting this better would be a good idea! Thanks!

joeflack4 commented 1 year ago

NP! I addressed the obsolete parents issue as well as some other problems I noticed, and added the documentation in #316.

Gonna re-assign this to Sabrina now. I think Nico may need to run the pipeline first, but I think we are just needing to verify that all of these GARD Mondo mappings are good.

matentzn commented 1 year ago

This is a long issue and important - @sabrinatoro can you move this forward and let me know what else is needed? The goal is to:

  1. Delete all gard mappings currently in Mondo
  2. replace them by the new GARD mappings
joeflack4 commented 1 year ago

Nico asked me to close this issue if it was done. I looked up and I saw that Nico has a comment w/ some checkboxes by which we can determine if we're done here.

  • all GARD ids are mapped (160 or so missing)
  • you have convinced yourself that Joes SSSOM file looks reasonable!

I think by SSSOM file he means the gard.sssom.tsv, or maybe he means the special spreadsheet I made.

So @sabrinatoro I will leave this open for now. But if you want to close this and make 1 or more, more specific issues in regards to anything else that needs to be done, we can do that too.

joeflack4 commented 6 months ago

@matentzn Don't mean to distract you but do you think this issue is still relevant?

matentzn commented 6 months ago

I assigned to myself, you can drop it off your radar.