Open sabrinatoro opened 1 year ago
@sabrinatoro this is an excellent analysis, Thanks a ton. @joeflack4 the most pressing issue is to figure out 3 - this must be a mistake in the data send to other, but double check if it really is a mistake or not.
Thank you @sabrinatoro! I sent some questions about some confusions I was having about their data source, but I should have spotted this. I followed up on Figure 3, and I found out that there are actually a ton of OMIM mappings that I mistakenly thought were ORDO mappings. I think we will get up to ~2,600 new mappings via GARD::OMIM::Mondo, which could bring our total mappings to about ~10,000/12,410. I will re-run this analysis soon!
@sabrinatoro I think the main concerns in this issue have been addressed, though perhaps we want to keep it open just until we're sure that mappings are good? I have not updated anything that would require the spreadsheet to change, but I did create a proper GARD SSSOM and put it in releases: https://github.com/monarch-initiative/gard/releases
Great work Joe. The subject and Object categories in your sssom files are not well formed - they should be proper pipe separated lists, e.g. Disease|Morphol.
.
@sabrinatoro I think you can close this issue when
@joeflack4 @matentzn Thank you! I finally have more time and will review this now. I have on my list that I should review the 160 ids that are missing and check whether I can map them. Is that correct? (in addition to general check)
I would be interested to see why they are not mapped, since we are supposedly on par with Orphanet / OMIM. So when you do the 160, try to map the corresponding OMIM / Orphanet IDs at the same time (or provide us with feedback why we had excluded them)
@matentzn wrote:
Great work Joe. The subject and Object categories in your sssom files are not well formed - they should be proper pipe separated lists, e.g.
Disease|Morphol.
.
Roger that, will fix that!
There are 175 gard terms that are unmapped in Mondo
3 are mapped to an OMIM (1) or orphanet (2) term that has been recently added to Mondo QUESTION: I am assuming that these will be x-ref to Mondo via omim/orphanet automatically when we re-run the pipeline. Is that correct?
65 are mapped to an orphanet term that is excluded from Mondo
4 are mapped to OMIM which are in Mondo as "Mondo:relatedTo" and excluded in Mondo
2 are mapped to OMIM term that is excluded from Mondo QUESTION: should we also exclude these GARD terms?
17 are mapped to OMIM (1) or Orphanet (16) which are in Mondo as "MONDO:mondoIsBroaderThanSource" QUESTION: Should these orphanet/OMIM terms be excluded from Mondo? What about the corresponding GARD terms?
83 are mapped to orphanet (83) which are not in Mondo (as equivalent or other), not in the exclusion list, not in the new alignment files (for what I can see). Looking at a handful of these terms, it looks like they might refer to grouping terms (ie we might have decided to not add them. @nicolevasilevsky is it possible that they are on your "do not include" list but haven't made it to the official exclusion list yet?) QUESTION: how should we proceed with these? I think we should first figure out why they are not in Mondo... should we update the alignment pipeline for omim and orphanet and go from there?
@matentzn @nicolevasilevsky @joeflack4 working spreadsheet : https://docs.google.com/spreadsheets/d/1DTedQy0B19XtDWa9O0WePL5SztHL_R_HzphoAwywwwE/edit#gid=0
Great progress Sabrina!
QUESTION: I am assuming that these will be x-ref to Mondo via omim/orphanet automatically when we re-run the pipeline. Is that correct?
I think so. If I run the full pipeline, it will download a new mondo.sssom.tsv
from http://purl.obolibrary.org/obo/mondo/mappings/mondo.sssom.tsv, so as long as that gets updated, we're good.
Figuring whats up with the 83 is highest priority! These must be traced through the entire pipeline - make sure that you do not only look at your exclusion list, but the extensional exclusion list:
They are not in the extensional exclusion list either.
extensional exclusion list: https://github.com/monarch-initiative/mondo-ingest/blob/main/src/ontology/reports/ordo_exclusion_reasons.robot.template.tsv
Is this the "full" list (e.g. including all the children terms to also be excluded)?
Spooky!
Spooky!
I KNOW!!! :-D I cannot find these anywhere... I am reviewing with @nicolevasilevsky as maybe she has them in any of her lists... but I am very surprised that they are not in any of the exclusion lists or the slurp/mapping lists.
👻
I'll look into it and report back
I am unsure why these aren't on the ORDO exclusion list but I think all of them can be.
We have to make sure. @joeflack4 can you please identify 3 terms that are not excluded an not in Mondo and trace them through the system so we know what they are not suggested by the migration code?
Sabrina:
Is this the "full" list (e.g. including all the children terms to also be excluded)?
Yep!
@matentzn I'll do that this week.
@sabrinatoro @matentzn (copy/pasting my PR comment here)
Take a look at the ordo_mapping_status-with_slurp_info.tsv
that I put in GoogleSheets.
I have a codebook worksheet that explains the columns and makes it clearer what's considered migratable/slurpable:
Field | Meaning |
---|---|
possible_slurp_candidate | This means that the slurp script was initially considering these as candidates to check for slurping. It means that they are not mapped, not excluded, and not deprecated. The next step after this is checking its parents. |
parent_obsolete | | delimited list of True/False as to whether or not the parent has been obsoleted in Mondo. |
qualified_parent_mondo_ids | | delimited list of qualified parents. For it to be "qualified", that means it must be (i) mapped to Mondo as a skos:exactMatch or skos:narrowMatch, and (ii) not obsoleted in Mondo. |
actual_slurp_candidate | This means that the term is migratable/slurpable and will appear in slurp/*.tsv. For a term to be migratable, it must:(i) meet the conditions of 'possible_slurp_candidate', but also (ii) Either (a) it has no parents, or (b) - ALL its parents must be mapped, and - must have 1+ qualified_parent_mondo_ids among those parents (meaning at least one must be an skos:exactMatch or skos:NarrowMatch and not obsoleted in Mondo). |
Findings Based on this analysis, it looks like the reasons we have so many ORDO terms that are not excluded and not mapped but yet are not showing up as slurpable is that 100% of them (n=117) have parents that are all marked obsolete in Mondo.
@matentzn @sabrinatoro Questions
*_mapping_status-with_slurp_info.tsv
: If this would be helpful in the future, I could make this a regular output.is_deprecated
,is_excluded
: I included whether or not the parent was mapped, but if its helpful I can add columns for these as well.Ahhh, ok! This is clear. Having a parent that is obsolete is not a reason to be excluded in the slurp. A mapped parent that is obsolete can be ignored by the has parent check. If there are no other parents, then the migration should proceed in the same way as if there are no parents at all.
I absolutely think documenting this better would be a good idea! Thanks!
NP! I addressed the obsolete parents issue as well as some other problems I noticed, and added the documentation in #316.
Gonna re-assign this to Sabrina now. I think Nico may need to run the pipeline first, but I think we are just needing to verify that all of these GARD Mondo mappings are good.
This is a long issue and important - @sabrinatoro can you move this forward and let me know what else is needed? The goal is to:
Nico asked me to close this issue if it was done. I looked up and I saw that Nico has a comment w/ some checkboxes by which we can determine if we're done here.
- all GARD ids are mapped (160 or so missing)
- you have convinced yourself that Joes SSSOM file looks reasonable!
I think by SSSOM file he means the gard.sssom.tsv, or maybe he means the special spreadsheet I made.
So @sabrinatoro I will leave this open for now. But if you want to close this and make 1 or more, more specific issues in regards to anything else that needs to be done, we can do that too.
@matentzn Don't mean to distract you but do you think this issue is still relevant?
I assigned to myself, you can drop it off your radar.
I reviewed https://docs.google.com/spreadsheets/d/1w5Xnzr5uNFcPrQqCT8mGBFHnGhwXDfQVzHKofw6kB7c/edit#gid=244617092
1) the created mapping (when there is a MONDO:id) are correct for all the one I checked examples:
2) in the cases where multiple OMIM are associated for 1 GARD and/or 1 Orphanet: These OMIMs id represents children terms of the Mondo term represented by the corresponding GARD/ORDO. (Note that the Mondo term is mapped to GARD/Orphanet correctly for the terms I checked). Therefore, if one needs to make a mapping between omim and gard, these omim ids would represent more specific concept than GARD. (omim = skos:narrowMatch of GARD ??I can never remember if it is the correct way to look at it??) Examples:
3) I noticed that some terms do not have a Mondo mapping. Most (maybe all?) of these have the same orphanet and omim ID (the number is the same for the orphanet and omim id). I think the orphanet IDs are incorrect because I couldn't find them in orphanet. The OMIM ID do match correctly to the Mondo term for which this omim id is an exact match Examples:
(the orphanet ID is incorrect and does not exist; the omim ids are correctly matched to the gard ids and could be used for mapping to Mondo)
4) I haven't seen an example of the following, but it might be a good "sanity check" to ensure that the Mondo corresponding to the orphanet ID and to the OMIM ID (when there is only one) are identical. If they are, I think we can adopt the mappings safely. If they are not, it would require review (and would probably due to change in parentage in OMIM or splitting of omim terms in mondo that are not represented in orphanet)
Please let me know if you have any questions. @matentzn @joeflack4