mapping-commons / disease-mappings

Repo to host disease ontology mappings
Creative Commons Zero v1.0 Universal
5 stars 0 forks source link

Ingest OMOP2OBO mappings #7

Open matentzn opened 2 years ago

matentzn commented 2 years ago

Lets focus on MONDO/ICD10 related ones for now.

cc @callahantiff

callahantiff commented 2 years ago

Sounds great!

Just so it's recorded here and since the way we import might be impacted by this. The mappings I sent this morning are the most "confident" i.e. those that are an exact match to a string in a label, definition, or synonym or those that were obtained from an existing dbxref from one the ontology or a support resource. There are other ways to get mappings (e.g., hierarchical search/traverse for parents or children and some fancy new recursive search that we can also leverage) and we can explore those in the future if you think they would be useful.

I also want to get your feedback on what I have included in the file since I opted to include a lot of information that makes the file sizes larger and that might not actually be helpful.

callahantiff commented 2 years ago

Last thing. In case it is helpful, here are all of the sources that the first version includes mappings from to a Mondo. The number is the count of unique Mondo concepts mapped to each source. There are duplicates here as I am reporting the original way a source has named each vocabulary (when I process these they are normalized on the backend).

Summary tables 2021AA - UMLS Metathesaurus | 7442 -- | -- AI/RHEUM, 1993 | 81 Alcohol and Other Drug Thesaurus, 2000 | 662 Alternative Billing Concepts, 2009 | 1 American College of Cardiology/American Heart Association Clinical Data Terminology, 2009D | 55 Anatomical Therapeutic Chemical Classification System, ATC_2021 | 1 Authorized Osteopathic Thesaurus, 2003 | 4 Beth Israel Vocabulary, 1.0 | 430 BioCarta online maps of molecular pathways, adapted for NCI use, 2009D | 1 Biomedical Research Integrated Domain Group Model, 3.0.3, 2009D | 5 CDISC Glossary Terminology, 2009D | 1 COSTAR, 1989-1995 | 588 COSTART, 1995 | 662 CRISP Thesaurus, 2006 | 1031 Cancer Data Standards Registry and Repository, 2009D | 393 Cancer Research Center of Hawaii Nutrition Terminology, 2009D | 5 Cancer Therapy Evaluation Program - Simple Disease Classification, 2009D | 150 Canonical Clinical Problem Statement System, 1999 | 800 Cellosaurus, 2009D | 760 Clinical Care Classification, 2_5_2018 | 10 Clinical Classifications Software Refined for ICD-10-CM, 2021 | 66 Clinical Classifications Software, 2005 | 150 Clinical Data Interchange Standards Consortium, 2009D | 463 Clinical Terms Version 3 (CTV3) (Read Codes), 1999 | 1734 Clinical Trial Data Commons, 2009D | 5 Clinical Trials Reporting Program, 2009D | 675 Common Terminology Criteria for Adverse Events 3.0, 2009D | 71 Common Terminology Criteria for Adverse Events 5.0, 2009D | 249 Common Terminology Criteria for Adverse Events, 2009D | 236 Consumer Health Vocabulary, 2011_02 | 1905 Content Archive Resource Exchange Lexicon, 2009D | 8 Current Procedural Terminology, 2021 | 1 DXplain, 1994 | 706 Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), 2015 | 49 Diseases Database, 2000 | 18 DrugBank, 5.0_2016_06_22, 5.0_2021_01_29 | 74 European Directorate for the Quality of Medicines & Healthcare, 2009D | 4 FDB MedKnowledge (formerly NDDF Plus), 2021_02_10 | 5 Foundational Model of Anatomy Ontology, 4_15 | 128 Gene Ontology, 2020_05_02 | 85 Geopolitical Entities, Names, and Codes (GENC) Standard Edition 1, 2009D | 89 Global Alignment of Immunization Safety Assessment in pregnancy, 2009D | 6 HCPCS Version of Current Dental Terminology (CDT), 2021 | 1 HL7 Vocabulary Version 2.5, 2003_08_30 | 66 HL7 Vocabulary Version 3.0, 2020_11 | 93 HUGO Gene Nomenclature Committee, 2020_05 | 895 Healthcare Common Procedure Coding System, 2021 | 1 Human Phenotype Ontology, 2020_10_12 | 1159 ICD10, 1998 | 789 ICD10, 2016 | 7991 ICD10, American English Equivalents, 1998 | 93 ICPC-2 PLUS | 842 ICPC2 - ICD10 Thesaurus, 200412 | 797 ICPC2 - ICD10 Thesaurus, American English Equivalents, 0412 | 1 International Classification for Nursing Practice, 2019 | 29 International Classification of Diseases, 10th Edition, Clinical Modification, 2021 | 2057 International Classification of Diseases, Ninth Revision, Clinical Modification, 2014 | 1028 International Classification of Diseases, Ninth Revision, Clinical Modification, Metathesaurus additional entry terms, 2014 | 752 International Classification of Functioning, Disability and Health for Children and Youth, 2008 | 6 International Classification of Functioning, Disability and Health, 2008_12_19 | 6 International Classification of Primary Care 2nd Edition, Electronic, 2E, 200203 | 105 International Classification of Primary Care 2nd Edition, Electronic, 2E, American English Equivalents, 200203 | 11 International Classification of Primary Care, 1993 | 87 International Conference on Harmonization, 2009D | 11 International Neonatal Consortium, 2009D | 7 International Statistical Classification of Diseases and Related Health Problems, 10th Revision, Australian Modification, January 2000 Release | 866 International Statistical Classification of Diseases and Related Health Problems, Australian Modification, Americanized English Equivalents, 2000 | 148 Jackson Laboratories Mouse Terminology, adapted for NCI use, 2009D | 3 KEGG Pathway Database, 2009D | 33 LOINC, 269 | 523 Library of Congress Subject Headings, 1990 | 508 Library of Congress Subject Headings, Northwestern University subset, 2013 | 692 MEDCIN, 3_2020_12_15 | 1698 Medical Dictionary for Regulatory Activities Terminology (MedDRA), 23.1 | 1925 Medical Entities Dictionary, 2003 | 6 Medical Subject Headings, 2021_2021_01_25 | 2217 Medication Reference Terminology, 2021_03_01 | 3 MedlinePlus Health Topics, 20201125 | 628 Metathesaurus FDA Structured Product Labels, 2021_02_19 | 10 Metathesaurus Source Terminology Names | 12 Metathesaurus Version of Minimal Standard Terminology Digestive Endoscopy, 2001 | 56 Multum MediSource Lexicon, 2021_02_01 | 31 NANDA-I Taxonomy II, 2018-2020 | 164 NCBI Taxonomy, 2020_05_21 | 90 NCI Dictionary of Cancer Terms, 2009D | 588 NCI Genomic Data Commons Terms, 2009D | 685 NCI HUGO Gene Nomenclature, 2009D | 93 NCI Health Level 7, 2009D | 6 NCI Integrated Canine Data Commons Terms, 2009D | 3 NCI Thesaurus, 2020_09D | 2334 National Council for Prescription Drug Programs, 2009D | 2 National Institute of Child Health and Human Development, 2009D | 1106 Neuronames Brain Hierarchy, 2020_05_28 | 143 Nursing Outcomes Classification (NOC), 6 | 82 Omaha System, 2005 | 13 Online Congenital Multiple Anomaly/Mental Retardation Syndromes, 1999 | 391 Online Mendelian Inheritance in Man, 2021_02_08 | 2119 Patient Care Data Set, 1997 | 12 Pediatric Cancer Data Commons, 2009D | 33 Perioperative Nursing Data Set, 4_2018 | 2 Physician Data Query, 2018_10_27 | 739 QMR clinically related terms from Randolph A. Miller, 1999 | 10 Quick Medical Reference (QMR), 1996 | 236 Read thesaurus Americanized Synthesized Terms, 1999 | 20 Read thesaurus, American English Equivalents, 1999 | 628 Read thesaurus, Synthesized Terms, 1999 | 24 RxNorm Vocabulary, 20AA_210301F | 6 SNOMED International, 1998 | 1471 SNOMED-2, 2 | 1246 Source of Payment Typology, 9.2 | 2 Thesaurus of Psychological Index Terms, 2004 | 294 U.S. Centers for Disease Control and Prevention, 2009D | 1 U.S. Food and Drug Administration, 2009D | 166 UMDNS: product category thesaurus, 2021 | 6 UMLS Metathesaurus | 270 US Edition of SNOMED CT, 2021_03_01 | 8363 USP Compendial Nomenclature, 2021_02_15 | 1 USP Medicare Model Guidelines, 2020 | 1 UltraSTAR, 1993 | 7 Unified Code for Units of Measure, 2009D | 23 University of Washington Digital Anatomist, 1.7.3 | 27 Vaccines Administered, 2017_02_08, 2021_01_29 | 21 Veterans Health Administration National Drug File, 2021_01_29 | 11 WHO Adverse Reaction Terminology, 1997 | 567 csp | 35 dermo | 1 doid | 9115 efo | 2639 gard | 5326 gtr | 33 hgnc | 43 hp | 517 icd-10 | 1 icd10 | 8849 icd10cm | 9 icd11 | 1 icd9 | 4110 icd9cm | 1 icdo | 655 ido | 1 kegg | 33 loinc | 1 meddra | 1316 medgen | 26 mesh | 7555 mfomd | 3 mondo | 107 mp | 3 mpath | 1 mth | 1 ncit | 6647 ndfrt | 1 nifstd | 18 obi | 1 ogms | 1 omim | 9619 omimps | 493 omop | 5 oncotree | 517 orphanet | 10292 pato | 1 pmid | 26 reactome | 1 scdo | 2 scitd | 1 sctid | 8413 sctid_2010_1_31 | 4 snomedct | 1 umls | 14440 umls_cui | 3 wikidata | 2 wikipedia | 82
joeflack4 commented 2 years ago

Just documenting here per Nico's request.

Tiffany recently produced and explained these ICD10::Mondo mappings:

My basic understanding is that OMOP2OBO was used to generate ICD10/ICD10CM::Mondo mappings. I think an input file (perhaps Mondo itself) was used, because there are some DBXREFs in there, which I imagine were obtained from Mondo. In the absence of direct cross references, exact string matches were used.

In addition to direct mappings (first tab in the file) there were also mappings done between Mondo terms and ICD term ancestors (first tab), and children (second tab). Sometimes ICD terms were mapped to Mondo children (second tab). I assume that in mapping to ancestors or children, there needed to be a starting place, so I imagine that came from the original set of mappings (from Mondo?) used as an input to this process.

@callahantiff If you can correct any of my misunderstanding, that would be great.


Here's the raw text from Tiffany's explanation:

The file has two tabs. Note that the first tab (i.e., “OMOP2OBO_ICD10_ICD10CM_ExactMap”) contains the primary mappings (19,139 Mondo concepts  6,588 ICD10/CM concepts). These mappings were created using the tested and most confident parts of the new functionality that will become available with the next release. Note that I have only included the exact string matches (to labels, synonyms, and definitions) and dbXRefs. Whenever possible mappings were created at the concept-level, but if a mapping could not be established at this level, then a mapping was attempted at the ancestor level. Currently, this works by traversing the hierarchy, where all parent concepts are searched until a match is achieved. An improvement over the initial release, when a concept is mapped at the ancestor level it will include an integer that specifies you how many levels (i.e., parent, grandparent, etc) above the concept the mapping was made. For example, the Mondo concept alopecia, isolated (MONDO_0000005) was mapped to the ICD10 concept nonscarring hair loss (L65.9) via it’s grandparent concept alopecia (MONDO_0004907). The evidence string provided for this mapping is: “OBO Ancestor: MONDO_0004907 - 2 level(s) above MONDO_0000005 on icd10:L65.9”. I’d love to know if you find this helpful. In the future, I think it could provide useful context for helping to generate a confidence score for the mapping (not something that I have yet, but I would love to implement this in the future).

A few important things to note:

  • I am still working on the best phrasing for the mapping evidence. Hopefully it makes sense, I tried to make the mappings as transparent as possible
  • The file contains duplicate rows this is intentional and was done to keep the evidence pieces for the different ways a mapping can be created between an ICD and Mondo concept separate. You can totally collapse the rows by combining the mappings, I just thought you might prefer to have it separate for now as you might prefer certain types of mappings over others (although this should not have an impact on the resulting mapping) and this would ensure that the file can be easily filtered. If you need help aggregating the file in this way, just let me know.

The second tab (i.e., “OMOP2OBO_ICD10_ICD10CM_ChildMap”) contains mappings from a beta feature that I have been working on and I included it just in case it might be helpful to you. These mappings are meant to help address the issue that ICD10 tends to be more granular than Mondo. Thus, these mappings take advantage of the ontologies descendant hierarchy. See examples in screenshot below. download In contrast to the approach used when mapping a concept at the ancestor level, here we are searching for more specific mappings in an effort to try and capture the loss of granularity between ICD and Mondo. So, you can see from above that we are able to extend the Mondo concept inflammatory diarrhea by mapping it to several more specific, but related ICD10 concepts. The string in the map_evidence column provides an explanation. Take the first row, the mapping evidence states that ICD10 A03 was mapped to MONDO_0000252 via it’s descendant concept MONDO_0019345, which is two levels below MONDO_0000252. I included in this figure one additional example – Piedra. Please note that I have not manually verified all of these mappings. I did perform a sport-check to remove many of the obviously incorrect mappings. There is still a chance that some errors may exist, but many of the mappings also look pretty good. You can be most confident of the mappings with map_type “DBXREF”. Let me know if you have any questions about these and please don’t feel like you have to use them, I included them because I thought they might potentially be useful to you.