monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

MedGen: MESH proxy merges #491

Open matentzn opened 2 months ago

matentzn commented 2 months ago

@kanems we were hoping of directly important the MESH mappings from MedGen into Mondo, but it seems that they are not exact in the strict sense.

Here are some examples provided by @twhetzel. A case where two different Mondo IDs are mapped to the same external term are strictly prohibited in Mondo.

Can you help us interpret these cases and advice if there is a way to filter out the non-exact ones?

Mondo ID MESH ID UMLS ID Source Precision
MONDO:0005145 MESH:C531617 UMLS:C1862941 MONDO:MEDGEN MONDO:equivalentTo
MONDO:0007103 MESH:C531617 UMLS:C1862939 MONDO:MEDGEN MONDO:equivalentTo
MONDO:0010924 MESH:C535306 UMLS:C1833429 MONDO:MEDGEN MONDO:equivalentTo
MONDO:0014072 MESH:C535306 UMLS:C5574940 MONDO:MEDGEN MONDO:equivalentTo
MONDO:0016001 MESH:C535306 UMLS:C2746066 MONDO:MEDGEN MONDO:equivalentTo
kanems commented 2 months ago

We have frequently encountered this issue with MeSH as well, that the definition/scope of some MeSH terms is not precise. For example: https://www.ncbi.nlm.nih.gov/mesh/?term=C531617 MeSH's primary name here is "Amyotrophic lateral sclerosis 1" but this is a supplementary concept and is "mapped to" the broader concept of 'ALS' but then uses synonyms of both familial and sporadic ALS.
My inclination would be to map the MeSH term based on the preferred term string https://id.nlm.nih.gov/mesh/C531617.json-ld says the preferred term label is ALS I, so I would match to MONDO:0007103 "preferredTerm" : "http://id.nlm.nih.gov/mesh/T727034", "label" : { "@language" : "en", "@value" : "Amyotrophic lateral sclerosis 1"

Within the UMLS data structure (which is how 90-99% of MedGen's MeSH mappings are retrieved and reported), they use "term type" codes to differentiate between name/string types (preferred, main, or other name types), but it seems there are multiple options to infer the primary/preferred name in UMLS's representation of MeSH... so we will have to take a look at this again for how we are pulling the data for our report (I do see we are reporting C531617 matching to multiple CUIs, which is what UMLS also says because the strings from MeSH represent multiple concepts).

As a short-term solution, it might be most true to the source to say the MeSH ID is non-exact for all the reported matches from UMLS (via MedGen). We can look into solutions to report matches for only 'main' or primary term matches in our reporting, which is what we have had to do with HPO's concepts (we say the HPO ID can match 1 and only 1 record in our system, but the HPO preferred names are cleaner and more consistently 'typed' in UMLS, so it was a lot easier to code this logic).

twhetzel commented 2 months ago

@kanems I was wondering if mapping to the MeSH Concept Unique Identifier vs. the Supplementary Record Unique Identifier is an option?

kanems commented 2 months ago

@twhetzel it looks like the MeSH Concept Unique Identifier is reported both by MeSH and UMLS, so that could be used to in lieu of string matching. https://id.nlm.nih.gov/mesh/C535306.json-ld specifies the 'preferred concept' "preferredConcept" : "http://id.nlm.nih.gov/mesh/M0525108", and that code is then included UMLS as the SCUI (source CUI). We report those for MedGen terms in our MGCONSO.RRF.gz file

$ $ zgrep "M0525108" MGCONSO.RRF.gz 
C1862939|P|VC|Y|A18448842||M0525108|C531617|MSH|NM|C531617|Amyotrophic lateral sclerosis 1|N|
C1862939|S|PF|Y|A20902222||M0525108|C531617|MSH|CE|C531617|Amyotrophic Lateral Sclerosis, Autosomal Dominant|N|

This also checks out for MESH:C531617

Currently, we only pull MeSH data via UMLS, and I don't think it's likely we'll institute a separate workflow for MeSH to extract these additional data. But there is a way to use the numeric IDs only to disambiguate an exact match from MeSH to a single CUI/concept in UMLS.