Open matentzn opened 2 years ago
Sounds great!
Just so it's recorded here and since the way we import might be impacted by this. The mappings I sent this morning are the most "confident" i.e. those that are an exact match to a string in a label, definition, or synonym or those that were obtained from an existing dbxref from one the ontology or a support resource. There are other ways to get mappings (e.g., hierarchical search/traverse for parents or children and some fancy new recursive search that we can also leverage) and we can explore those in the future if you think they would be useful.
I also want to get your feedback on what I have included in the file since I opted to include a lot of information that makes the file sizes larger and that might not actually be helpful.
Last thing. In case it is helpful, here are all of the sources that the first version includes mappings from to a Mondo. The number is the count of unique Mondo concepts mapped to each source. There are duplicates here as I am reporting the original way a source has named each vocabulary (when I process these they are normalized on the backend).
Just documenting here per Nico's request.
Tiffany recently produced and explained these ICD10::Mondo mappings:
My basic understanding is that OMOP2OBO was used to generate ICD10/ICD10CM::Mondo mappings. I think an input file (perhaps Mondo itself) was used, because there are some DBXREF
s in there, which I imagine were obtained from Mondo. In the absence of direct cross references, exact string matches were used.
In addition to direct mappings (first tab in the file) there were also mappings done between Mondo terms and ICD term ancestors (first tab), and children (second tab). Sometimes ICD terms were mapped to Mondo children (second tab). I assume that in mapping to ancestors or children, there needed to be a starting place, so I imagine that came from the original set of mappings (from Mondo?) used as an input to this process.
@callahantiff If you can correct any of my misunderstanding, that would be great.
Here's the raw text from Tiffany's explanation:
The file has two tabs. Note that the first tab (i.e., “OMOP2OBO_ICD10_ICD10CM_ExactMap”) contains the primary mappings (19,139 Mondo concepts 6,588 ICD10/CM concepts). These mappings were created using the tested and most confident parts of the new functionality that will become available with the next release. Note that I have only included the exact string matches (to labels, synonyms, and definitions) and dbXRefs. Whenever possible mappings were created at the concept-level, but if a mapping could not be established at this level, then a mapping was attempted at the ancestor level. Currently, this works by traversing the hierarchy, where all parent concepts are searched until a match is achieved. An improvement over the initial release, when a concept is mapped at the ancestor level it will include an integer that specifies you how many levels (i.e., parent, grandparent, etc) above the concept the mapping was made. For example, the Mondo concept alopecia, isolated (MONDO_0000005) was mapped to the ICD10 concept nonscarring hair loss (L65.9) via it’s grandparent concept alopecia (MONDO_0004907). The evidence string provided for this mapping is: “OBO Ancestor: MONDO_0004907 - 2 level(s) above MONDO_0000005 on icd10:L65.9”. I’d love to know if you find this helpful. In the future, I think it could provide useful context for helping to generate a confidence score for the mapping (not something that I have yet, but I would love to implement this in the future).
A few important things to note:
- I am still working on the best phrasing for the mapping evidence. Hopefully it makes sense, I tried to make the mappings as transparent as possible
- The file contains duplicate rows this is intentional and was done to keep the evidence pieces for the different ways a mapping can be created between an ICD and Mondo concept separate. You can totally collapse the rows by combining the mappings, I just thought you might prefer to have it separate for now as you might prefer certain types of mappings over others (although this should not have an impact on the resulting mapping) and this would ensure that the file can be easily filtered. If you need help aggregating the file in this way, just let me know.
The second tab (i.e., “OMOP2OBO_ICD10_ICD10CM_ChildMap”) contains mappings from a beta feature that I have been working on and I included it just in case it might be helpful to you. These mappings are meant to help address the issue that ICD10 tends to be more granular than Mondo. Thus, these mappings take advantage of the ontologies descendant hierarchy. See examples in screenshot below.
In contrast to the approach used when mapping a concept at the ancestor level, here we are searching for more specific mappings in an effort to try and capture the loss of granularity between ICD and Mondo. So, you can see from above that we are able to extend the Mondo concept inflammatory diarrhea by mapping it to several more specific, but related ICD10 concepts. The string in the map_evidence column provides an explanation. Take the first row, the mapping evidence states that ICD10 A03 was mapped to MONDO_0000252 via it’s descendant concept MONDO_0019345, which is two levels below MONDO_0000252. I included in this figure one additional example – Piedra. Please note that I have not manually verified all of these mappings. I did perform a sport-check to remove many of the obviously incorrect mappings. There is still a chance that some errors may exist, but many of the mappings also look pretty good. You can be most confident of the mappings with map_type “DBXREF”. Let me know if you have any questions about these and please don’t feel like you have to use them, I included them because I thought they might potentially be useful to you.
Lets focus on MONDO/ICD10 related ones for now.
cc @callahantiff