Open nicolevasilevsky opened 2 years ago
Discussed on curation call: We should split the terms if the source (including MedGen) is splitting them
Asked Adriana via email: Would you mind sharing the MedGen IDs that you assigned to the more specific subtypes for the classes, and I can split these into more specific classes in Mondo.
From Adriana/MedGen (email: MedGen reports on CUI conflicts with Mondo data)
Medgen serves the clinical community by enabling discovery of phenotype information and it also serves as the phenotype backbone for submitters of ClinVar (variant interpretation) and GTR (test descriptions). The specificity of phenotypes is vital for accurate variant interpretations and test descriptions.
OMIM has phenotypes described as different categories on the same OMIM record: the preferred name, included titles, allelic variants names, etc are sometimes distinct phenotypes. Therefore, we process this data from OMIM that often is not represented elsewhere or can be tricky to map relative to the primary concept on an OMIM record.
Some examples from OMIM:
The primary disease record for OMIM 301044 is “Developmental and epileptic encephalopathy 85, with or without midline brain defects” (you can see the OMIM record here https://www.omim.org/entry/301044). Thankfully, the OMIM preferred name, alternative titles/symbols and the string in the “Phenotype-Gene relationship” table (which is pulled from GeneMap2.txt file) are all in alignment. But, when OMIM spells out the specific alleles in their data for the causative gene SMC1A they have more narrow strings to describe the specific phenotypes associated with specific variants: Allele # 9 on this page; https://www.omim.org/entry/300040#allelicVariants is called “DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 85 WITH MIDLINE BRAIN DEFECTS” but allele entry #10 is “DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 85 WITHOUT MIDLINE BRAIN DEFECTS.” So these two alleles are very specific, whereas the MIM primary phenotype record is broad. So we would treat these as 3 distinct records in MedGen. Hopefully this demonstrates the degree of granularity that we have to achieve to faithfully represent the source data from OMIM and capture our submitters’ intent for their data. Mondo has the primary OMIM record (MONDO:0026771) but the individual WITH / WITHOUT strings from above are not represented in Mondo at this point.
And there is still the open issue of cases where the GeneMap2.txt strings which are not exact matches to the Preferred or Alternate names that Donna brought up on https://github.com/monarch-initiative/mondo/issues/4521
Another example:
OMIM# 256550 https://www.omim.org/entry/256550 contains 3 separate phenotypes: parent term NEURAMINIDASE DEFICIENCY and the 2 phenotypes it causes Sialidosis Type I which is the milder type and Sialidosis type II which is a severe form and further classified into congenital, infantile, and juvenile forms.
I hope these cases illustrate the need for specificity in the phenotype descriptions because when interpreting a variant it does matter what phenotype type/form you are evaluating. Would you be able/willing to process the different phenotypes described in an OMIM record and curate them into separate records if the phenotypes are distinct? We could then rely on your OMIM dataset.
Related to #4691