monarch-initiative / mondo

Mondo Disease Ontology
http://obofoundry.org/ontology/mondo
Creative Commons Attribution 4.0 International
225 stars 53 forks source link

Remove all UMLS:CN* cross references from Mondo #7035

Closed matentzn closed 4 months ago

matentzn commented 9 months ago

I advocate to remove all UMLS:CN cross references from Mondo as they falsely suggest that they are UMLS cross references - they are not! Neither will they ever resolve anywhere. UMLS:CN cross-references, correct me if I am wrong @kanems, are temporary UMLS CUIs created by the MedGen team to facilitate curation efforts in cases where UMLS is missing a disease. These terms may later be replaced by proper UMLS CUIs.

See question by @twhetzel in https://github.com/monarch-initiative/mondo/issues/6560#issuecomment-1856065251_

What I would like to do instead is:

Related

kanems commented 9 months ago

Discuss if there is value at all in curating MEDGEN_UMLS_CUI:CN references from Medgen in Mondo. @kanemshttps://github.com/kanems, what do you think? Is there a reason why the general public may like MEDGEN_UMLS_CUI:CN cross references? The majority of Mondo's terms will, eventually, get a UMLS CUI (rationale: many shared sources between Mondo and UMLS like NCI, Orphanet, SNOMED CT, MeSH... UMLS' full list is here https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html ). Based on UMLS release timing, it could be anywhere from 6-12 months before a CN could be replaced by a new CUI in MedGen.

A handful may never get a CUI, however. Example cases: A)Mondo terms to support ClinGen, since ClinGen is not a specific source for UMLS vocabulary (unless SNOMED or another UMLS source adopts ClinGen's naming convention) B) cases where UMLS lumps strings that match OMIM genetic subtype "1" and the OMIM Phenotypic Series for a disease (the PS terms are generally not imported into UMLS)

We'd appreciate if Mondo pointed to MedGen for these cases, as we aim to offer the clinical community resources beyond what Mondo has (clinical practice guidelines, links to GTR tests or ClinVar submissions for those disease concepts, etc.) but I think adding the current MedGen CNs in Mondo can be addressed at a later time. As @Joe @.***> pointed out to us, we were actually missing a ton of CN ID to Mondo mappings in the report from our FTP site. We've figured out why, but with limited development capacity we won't be able to fix that till next year. So this update would miss most of those CN ids anyways. IMO, it's best to skip adding CNs entirely until we fix things on our end, then Mondo can get the whole data set in one ingest (IFF we agree there's utility to link to MedGen for the Mondo ID). -Megan

twhetzel commented 9 months ago

I am a little confused with the proposed workflow. In a previous PR linked to issue #6560, xrefs to UMLS that contained "CN" were removed. In hindsight I am not sure if those should have been removed vs. marked as obsolete. However, in that previous PR I think only CNs with equivalentTo were removed and not equivalentObsolete, which for the sake of consistency seemed odd to me. Should these other equivalentObsolete also be removed ... or should those other CNs that were removed been converted to equivalentObsolete?

matentzn commented 9 months ago

@kanems thank you for your explanations; we would like to point to MedGen, 100% for any disease that is covered by MedGen. Lets consider this example: https://www.ncbi.nlm.nih.gov/medgen/?term=CN229293

image

Since all terms in MedGen seem to have an associated Medgen UID, my thinking was that we should simply provide linkouts to all of these, say, in this case, MONDO:0014753 to MEDGEN:833442. We can easily resolve this page:

https://www.ncbi.nlm.nih.gov/medgen/?term=833442

If we map all MONDO ids to MEDGEN UIDs, do you believe that there is any value in also providing cross-references to the CN ids? Arent these 100% redundant with the MEDGEN UIDs?

@twhetzel

Sorry I didn't explain my workflow at all here - my suggestion is to just get rid of all CN cross references, regardless of whether they are obsolete, exact, or anything else, and replace them by MEDGEN UIDs.

kanems commented 9 months ago

Since all terms in MedGen seem to have an associated Medgen UID, my thinking was that we should simply provide linkouts to all of these, say, in this case, MONDO:0014753 to MEDGEN:833442

Yes, All records in MedGen have a UID, regardless of the format of the concept unique identifier type assigned, so a CN# and a UID can alternatively be used to point to the same active MedGen record. It's when we merge records during curation that we potentially run into an issue. MedGen is able to track CN/CUI history (CN#1 was replaced by CUI-1; CN#2 merged into CN#3; etc) and redirect to the active MedGen page (based on CN/CUI history). I do not think we can do the same based on UIDs in our current framework. When a UID is no longer active, it returns a 404 error page (at least it did today for a record I recently curated to merge with a redundant record in our system). So I would need to see if we could even track UID history and redirect from those... I do not think that would be an easy lift for us. How often are you all thinking of updating the MedGen X-refs in Mondo? If Mondo is able to update the UIDs from MedGen somewhat regularly (every 1-2 months?), then I don't think we'll end up with a lot of dead links from Mondo to MedGen at any particular point in time and we could just press on with the UIDs. We can also implement curation approaches that limit the movement of Mondo IDs to new UIDs to try and minimize these dead X-refs.

matentzn commented 8 months ago

I think we can update MedGen xrefs basically weekly, as in my view, this process can be completely automated.

@kanems

  1. can you provide us (again, just to be sure) with a static URL to a file that contains the absolute up-to-date authority mapping between Mondo IDs and MedGen UIDs and
  2. can you discuss with your team if you would be willing to maintain a TSV file with a format that we provide in addition to the one you already have? This would reduce the maintenance burden considerably and make it feasible to basically stay 100% in sync.

Thanks for your amazing support!

kanems commented 8 months ago

@matentzn sorry for the delay, i tried to reply by email last week and just saw it bounced as undeliverable ...

1) https://ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz 2) I will ask about the TSV format, but we do have a secondary directory for CSV versions of many (but not all) of our reports. Would a CSV file be acceptable? Here is the CSV directory path: https://ftp.ncbi.nlm.nih.gov/pub/medgen/csv/ (We don’t yet make a CSV version of this ID mappings file, but if you have a different structure of interest, then we can see about making a new Mondo_MedGenMapping report.

matentzn commented 8 months ago

That is fantastic. We will send you an example soon! Thank you @kanems.

kanems commented 8 months ago

@matentzn Sorry, can you clarify: Would CSV be acceptable instead of TSV?

matentzn commented 8 months ago

Yes, no difference for us!

matentzn commented 8 months ago

@joeflack4 are you using https://ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz in the medgen ingest?

joeflack4 commented 8 months ago

@matentzn Yes: https://github.com/monarch-initiative/medgen/blob/e70aa96a23d61161428afade432ff834c9dcafa9/makefile#L113-L114

matentzn commented 8 months ago

Thank you @joeflack4!

@kanems another question: are there Mondo IDs that are "out of scope" for Medgen? i.e., if we were to

  1. Remove all UMLS cross references in Mondo
  2. Replace them by the Mondo->UMLS xrefs curated by Medgen

what would we loose?

kanems commented 8 months ago

@matentzn excellent point! MedGen does not include (non-human)animal diseases. And within human diseases we are not displaying injury, poisoning, and 'disease characteristic' branches from Mondo (mostly... our system isn't structured to handle/display hierarchal relationships well in our curation interface, so that's got a little more wiggle room for error).
There may also be lags/gaps in our processing vs. Mondo releases on any given month, so I would recommend this approach that is a little more conservative :

  1. Remove CUIs in Mondo IFF the Mondo ID exists in the MedGen FTP file.
  2. Add the CUI from the MedGen mapping file. I think this will ensure Mondo gets only GOOD/reviewed mappings as replacements for human disease concepts. We're not in a position to review the non-human disease CUI mappings.
matentzn commented 8 months ago

Final action items for this issue

@kanems thank you very much for your details. In summary, this is what we should to do in my opinion. Feel free to contradict me where necessary:

@twhetzel, @kanems and @joeflack4 please confirm that we are all agreed on this strategy and I will set it in motion.

kanems commented 8 months ago

I have no objections; I think this looks like a solid plan. Item #4 (all UMLS:CNs removed) is especially important.

joeflack4 commented 8 months ago

I think this sounds good. If you can clarify, @kanems, why should we delete CNs again? It sounds like Nico is saying to delete them because they're redundant with UIDs. But the emphasis you put on their deletion indicates to me that there is some other reason.

kanems commented 8 months ago

@joeflack4 It's especially important because UMLS does not use CN# identifiers, those are entirely from MedGen (so even if they were to stay, they would be MedGen:CN####). But the decision to use UIDs is already set and, I think, a better solution to get a pointer to MedGen's records/resources for the appropriate Mondo concept. And not to re-open the now closed ingest issue, but we did figure out why there were missing mappings in our source file and they were primarily to the CN# style 'stand in CUIs' which we will correct. But since CNs are not going to be used in Mondo and you all can run the UID/CUI mapping update regularly, we need to table that fix and prioritize some infrastructure upgrades that NCBI has given us a hard deadline for implementing.