monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
5 stars 3 forks source link

MedGen reports on CUI conflicts with Mondo data #273

Open nicolevasilevsky opened 2 years ago

nicolevasilevsky commented 2 years ago

Related to: https://github.com/monarch-initiative/mondo/issues/2841

From @kanems cc @matentzn

Here is a batch of reports for various CUI/CN-type identifiers that seem to be in conflict between Mondo’s reporting and MedGen’s mapping of the CUIs/Mondo IDs.

A quick note about those CNs: these are generated by MedGen so that we have some external identifier to support terms that may not have been mapped to a concept in UMLS. These often arise due to submissions to our database from ClinVar or GTR users that we determined should be public in MedGen, but for many of the cases in these reports, we generated CN-identifiers to support a Mondo concept that UMLS lumps with other Mondo concept(s) on a single CUI. Our policy is that each external source for MedGen (such as Mondo, OMIM, or HPO) will exist once and only once in MedGen and, whenever possible, each external source ID will be the only one of its kind on a MedGen term (i.e. OMIM ID:CUI should be a 1:1 relationship) with very rare exceptions. Thus, when Mondo has 2 concepts where UMLS has one, one Mondo ID gets mapped to a term with a CN identifier.

The first type of conflict is where Mondo is reporting a CUI from UMLS or a CN-identifier as a current X-ref in Mondo but these are not found as current identifiers in our database. Presumably these could be retired entirely from Mondo’s mappings/x-refs.

The file ‘CUI_CN_notCurrentinMedGen.csv’ lists the MondoID, the type of X-ref we are getting from Mondo and the numeric identifier for that reference. I am noting right at the top of this report there are two “MEDGEN” ref types and the numbers seem to be MedGen UIDs, but the more informative identifier would probably be the UMLS CUI (these are more meaningful to folks who aren’t frequent MedGen users).

The second type of conflict is where there are active, valid CUIs/CNs, but Mondo and UMLS do not agree on the Mondo-CUI mapping. I am including an excel doc with multiple reports that may or may not be overlapping, but I have tried to group these based on the type of conflict. File ‘CUI_mismatch_MedGen_vs_Mondo-MultipleScenarios.xlsx’

Tab 1: Mondo-CUI mismatch, Single Mondo ID: One Mondo ID is mapped to One CUI in Mondo, but MedGen team has mapped to a different, single CUI in our data. Tab 2: Mondo-CUI mismatch, Single CUI: Mondo links Mondo ID(1) to a CUI, but UMLS maps Mondo ID(2) to that CUI (these may be concepts that Mondo wants to consider merging) Tab 3: Mondo reports >1 CUI per ID, but MedGen maps 1 MondoID to 1 CUI; We report the single Mondo ID, a link to view how we represent this data in MedGen, the Mondo ID and CUIs used for the MedGen query, and lastly which CUI which we use to map the Mondo ID. These may be scenarios where UMLS has too many CUIs for a single disease concept or Mondo has mixed X-refs to more than one concept from UMLS. This list may be overlapping with the first report of ‘notCurrent’ CUI/CN IDs. This tab is specific to CUIs, does not include any CN identifiers. Tab 4: Mondo reports >1 CUI or CN per ID, but MedGen maps to only 1 of those. This is limited to MondoIDs with reported (or actual) mapping to a CN identifier (which were filtered out in the above list). These also may be partially covered/cleaned up form the first ‘notCurrent’ report.

CUI_mismatch_MedGen_vs_Mondo-MultipleScenarios.xlsx

CUI_CN_notCurrentinMedGen

joeflack4 commented 1 year ago

Hmm, @nicolevasilevsky @matentzn. I tried my best to understands like there are two groups of cases (1) mappings that are no longer valid (probably because obsolete/removed from Medgen) and need to be removed from Mondo, and (2) various forms of mappings that need changing.

I'm not sure if these tasks are within my scope, at least not yet. I will stay posted!

matentzn commented 1 year ago

Lets deal with this issue when we deal with the entirety of Medgen - after GARD and ICD11.

joeflack4 commented 1 year ago

@kanems I'm just getting to this now. If you have any updates to the table of conflicts, this would be a good time to submit them!

kanems commented 1 year ago

Joe- can it wait ~5 days? We will run our processing to bring in the July 2023 Mondo release on the 10th and then I can provide the most up to date version of our reports. Data conflicts that we reported before that I will generate again:

  1. Mondo reports a CN-type CUI or UMLS CUI that MedGen does not have as an active ID.
  2. Conflicts in CUI mapping to a Mondo ID between MedGen's mappings and Mondo's reported X-refs. (The conflicts could represent a hand of cases, and I'll sort them out based on the specific conflict: Scenario A: Per Mondo, Mondo:X maps to CUI 1, but MedGen maps Mondo:X to CUI 2. Scenario B: Mondo maps Mondo:X to CUI 1, MedGen maps Mondo:Y to CUI 1. Scenario C: Mondo:X maps to CUI 3 and CUI 4, MedGen can only have 1 Mondo ID: 1 CUI, some of the Mondo CUIs may be obsolete. Scenario D: Mondo:X maps to CUI 5 and Cui CN1; one of these is incorrect.) If you want something sooner to get started on, I can get you the list for report #1 before the 10th, as I think that will have sufficient data that requires review.

-Megan

From: Joe Flack @.> Sent: Thursday, July 6, 2023 11:46 AM To: monarch-initiative/mondo-ingest @.> Cc: Kane, Megan (NIH/NLM/NCBI) [C] @.>; Mention @.> Subject: [EXTERNAL] Re: [monarch-initiative/mondo-ingest] MedGen reports on CUI conflicts with Mondo data (Issue #273)

@kanemshttps://github.com/kanems I'm just getting to this now. If you have any updates to the table of conflicts, this would be a good time to submit them!

- Reply to this email directly, view it on GitHubhttps://github.com/monarch-initiative/mondo-ingest/issues/273#issuecomment-1623906271, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQ6KOUSZSPAUX5DMFUYFJ23XO3MUBANCNFSM6AAAAAAW3Y7D2U. You are receiving this because you were mentioned.Message ID: @.**@.>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

matentzn commented 1 year ago

@kanems Awesome THANK you!

@joeflack4 -> lets focus on ingest while @kanems is regenerating the reports!

joeflack4 commented 1 year ago

@matentzn Sounds good; thanks @kanems !

kanems commented 1 year ago

README_July2023_CUIReports_FromMedGentoMondo.txt July2023_CUIReports_FromMedGentoMondo.xlsx Here are the reports again along with a description file (README) for the different tabs/reports in the excel file.

These reports should be up to date with the July 2023 Mondo release. We're still waiting on a handful of Mondo obsoletions/mappings to be sorted out on open tickets, but this batch report should be comprehensive otherwise of issues we would like to resolve regarding CUI mapping to Mondo records. If you have any questions specific to these reports, please let me know.

joeflack4 commented 1 year ago

@kanems Thanks! I will be taking a look at this next week and will get back to you.

joeflack4 commented 1 year ago

@kanems Alright, I finally got to analyzing this more thoroughly. So far I have probably an 80% understanding of the 5 reports/sheets. The tab names and descriptions in the new XLSX are a little different from before, so I will just check (unless you can tell me) if the nature of these reports are exactly the same as before in the original post, or if they are a little different.

I'm going to meet with Nico tomorrow and hopefully go through my observations / questions with him first, and then I hope to come back to you with some questions and an action plan!

kanems commented 1 year ago

I ran the same basic queries on our database, but I did update the headers for the reports to try and be internally consistent and a little more clear (or so I thought) vs. what we sent over last year. If the specific data conflict in each report isn't clear from the readme, we'll be happy to hop on a call with you and Nico to go through one or two cases from each list so you can get a more concrete example.

joeflack4 commented 1 year ago

Thanks for the clarification, @kanems . Nico and I discussed and some things are clearer. I'm going to look at the Excel this weekend, and come back to you if I have any questions.

For now, I do have some immediate questions.

Question 1

Nico said:

Should we replace ALL of our current Mondo->Medgen mappings with the up-to-date Medgen->Mondo mappings that you have created?

Also a question from me to you in light of Nico's question; I didn't have a chance to ask him. I'm wondering what he meant by "mappings that you have created"? I only see a list of conflicts from you here. Have you created a list of mappings elsewhere that he is talking about? If so, can you link/inform me? If not, then maybe I transcribed his question incorrectly.

Question 2

Additionally, the approach in question 1 (using mappings you may have created) is potentially in conflict with another instruction I got from Nico (@Nico sorry for the confusion; feel free to hop in here to clarify if you see this). Originally, the approach we discussed was that I would download medgen, run an ingest to convert to .owl, consider mappings that Medgen makes directly to Mondo, and also find 'proxy mappings' between Medgen and Mondo. For example, if Medgen:123 is mapped to Orphanet:456 is mapped to Mondo:789, we would accept the mapping of Medgen:123 --> Mondo:789. So basically what I'm asking you is, if there are no such mappings that you've specifically created as described in 'question 1', do you have any objections to this mapping strategy I just described here?

kanems commented 1 year ago

@joeflack4 @matentzn There are two ways to address these discrepancies between Mondo and MedGen's mappings (specifically we're hoping to get CUIs and CNs updated in Mondo, but MedGen also has 'UIDs' which are just numeric identifiers that are used for individual entries on our web interface.)

  1. Resolve only the current data conflicts, which are covered in the reports I sent that focus on specific Mondo IDs and the UMLS/MedGen x-refs that seem incorrect or out of alignment with our mappings. I thought this was the plan on your end, since a whole new ingest pipeline might be more than you have bandwidth to handle?

  2. Mondo can take in and try to accommodate all of MedGen's x-ref mappings (CUI: Mondo : HPO : OMIM:Orpha:... etc) from our FTP report (root FTP page is https://ftp.ncbi.nlm.nih.gov/pub/medgen/ ) MedGenIDMappings.txt.gz . Based on your questions above, it sounds like this is what you are considering now instead?

RE; Question 1: In our latest download of Mondo, I am looking at the X-refs we imported and stored but I do not see "MedGen" as a class of x-refs ( what I can find are: UMLS , GARD, en#, obo, MEDDRA, ICD10CM, OMIM, Orphanet, SNOMEDCT, MESH, OMIMPS)
So I don't know what Nico considers as Mondo->MedGen mappings, because it's not reported explicitly as such (at least not as we have converted the data for storage in our SQL databases...) .
The identifiers that should be reported as coming from MedGen are the CUIs that start with "CN" since those are not UMLS-generated CUIs. And a great many of those reported in Mondo right now are incorrect, so I could see some utility in just removing all of those UMLS:CN####### styled x-refs, then bringing in the current CN:Mondo mappings from MedGen's FTP file mentioned above, attributing them to MedGen instead of UMLS. (From the FTP file, you could select out just the rows where CUI like 'CN' and source_id like 'MONDO:' and that should give you the CN:Mondo ID mappings that are current)

RE: Question 2: MedGen should have available (in the FTP file above) our mappings to all Mondo IDs that are in-scope for MedGen (we don't display some categories like poisoning, injury, non-human diseases, and radiation-induced diseases, since they are not specifically genetic in their nature or not relevant to humans) It's possible that your suggested proxy mapping strategy will catch some Mondo IDs that we haven't processed correctly, in which case we'll be glad to fix that on our end and make the Mondo ID available in MedGen.

How MedGen maps Mondo IDs and other sources will be covered in our presentation next week, but if there are still questions after that, we can meet (I probably would need to loop in a developer who understands our scripts that handle the automated mappings and Mondo data processing; I only understand the big principle concepts for how those pipelines work).

joeflack4 commented 1 year ago

Thanks a lot @kanems for your answers/info!

Approaches and which to take

  1. Approach 1: Fix existing Mondo->MedGen mappings and utilize your detailed conflict analysis.
  2. Approach 2: New Mondo->MedGen mappings based on explicit and proxy mappings taken from MedGen.

I took care to look through and understand the nature of all of the conflicts you sent me. But after analysis/thought/discussion, I think it is likely that we're going to go with approach 2. The work will be done / released in our MedGen ingest repo, which uses the FTP source. Reasons not to do approach 1: (i) I get the impression that the MedGen's mappings to Mondo are higher quality than Mondo's MedGen mappings (as can be seen by the conflicts), and (ii) perhaps counterintuitively, it is actually easier to go with approach 2. I don't think we have a lot of curator bandwidth for addressing the conflicts, but I do have the bandwidth to do this new mapping / ingest strategy. And fortunately for me, Chris Mungall did a ton of the heavy lifting a couple years ago for this.

A few FYIs

FTP source files used

MedGenIDMappings looks very convenient. I checked to see if Chris utilized this in the ingest, but he did not. However, I think that he was able to get the same result using several RRF files. I'll do a QC check to see if that's the case.

UMLS vs MedGen prefixes

I saw your comment about not finding mappings labeled MedGen in Mondo. We think that some of these mappings should continue to use the UMLS prefix, but others should use MedGen. Here's our strategy. If you think there's a mistake here or have any other input, let me know!

i. If ID starts with CN, use MedGen prefix ii. Else if ID starts with C, add 2 duplicate entries: 1 with UMLS prefix, another with MedGen prefix Nico and I only discussed this briefly, so I wouldn't be surprised if we drop one of these duplciates. Any thoughts? iii. Else, it is a MedGen UID and we can ignore / not import it _I'm not 100% sure if we'll continue to do that though. We could also give them a prefix MedGen or MedGen_UID. Any thoughts?_

MedGen ID overlap/difference analysis

I'm not sure how useful this is for you, but this is mainly for @matentzn's purposes. This file looks at all the referenced MedGen IDs in both Mondo and MedGen, and shows if they exist in one, the other, or both. I think this overlaps with / is a different way of looking at some part(s) of your conflict analysis.

medgen_terms_mapping_status.tsv.zip

My next steps

I've begun work on "approach 2". I expect within the next 2 weeks to have new proposed mappings. I'll share them with you; I'm glad to hear that you might find the proxy mappings useful!

More questions

  1. I think that after I finish this ingest work, it could be good to rerun your conflicts analysis. Do you have that analysis set up in an automated way by any chance? If so, it could be something that I could add to the test suite for our MedGen ingest.
  2. @kanems (cc @matentzn) Any chance you could share the time / link for the presentation that you mentioned? About how "How MedGen maps Mondo IDs and other sources".
matentzn commented 1 year ago

@joeflack4 this all sounds reasonable.

@kanems:

I am wondering about the difference between what @joeflack4 calls Medgen UIDs and UMLS:CN terms. Both identifier schemes seem to be generated by Medgen - why are they distinct? Which do you consider the "offical external MEDGEN ID"?. Its entirely possible I am misunderstanding something.

@joeflack4 -> I am pro redundancy and would like all the MedGen sourced IDs separately from UMLS. But I dont really understand the above, and I dont really want MEDGEN: and MEDGEN_UID: as seperate spaces.

kanems commented 1 year ago

@joeflack4 questions first... Regarding the reports we sent of specific conflicts: If the goal is to bring in UMLS CUIs separately, then can you at a minimum remove the CUI/CN IDs from "MondoXrefs_NotinMedGen" since we are confident that these are not current/correct IDs to attribute to UMLS? Hopefully this would be straightforward and help lots of users who also use UMLS CUIs. (from the readme file I sent: Mondo ('mondo_id') reports a reference of type ..."UMLS" ... and the errant reference ID ('ref_target') is provided. There are 3 scenarios here: A) the ref target provided is a numeric only string but it is attributed to UMLS ref_type; these numeric IDs are actually MedGen UID (unique Identifiers) (example: 576470 This is a MedGen UID, link built to MedGen's entry like this: https://www.ncbi.nlm.nih.gov/medgen/?term=576470) B) a UMLS CUI is used but is not a current CUI. (example; C0011127 C0011127 last appeared in the 2022AB version of the UMLS. It was merged into C4554531.) C) a CUI of 'CN' is used AND attributed to UMLS; this is problematic because UMLS does not use CN-type CUIs, those are Unique to MedGen, BUT the CNs on this report are all obsolete. (Example: UMLS CN034131; if you search this on MedGen (i.e. https://www.ncbi.nlm.nih.gov/medgen/?term=CN034131) you see a small message the top of the page that says "CN034131 has been replaced by C4551509, showing C4551509" ) Not all the obsolete CNs will cleanly redirect to a current MedGen entry, a known issue with our system, so it will be best to remove these CNs entirely.

  1. ... Do you have that analysis set up in an automated way by any chance? If so, it could be something that I could add to the test suite for our MedGen ingest.

We have these queries saved, yes, but they are currently written to run on our SQL server database internally. These queries cannot be run on the FTP tables/files. I tried to explain the logic for each report in the readme, let me know if you would need more information to tailor these data mapping checks to run on your side.

  1. Any chance you could share the time / link for the presentation that you mentioned? About how "How MedGen maps Mondo IDs and other sources".

https://mondo.monarchinitiative.org/pages/workshop/#outreach MedGen is presenting this Friday for the Mondo outreach call.

@matentzn

am wondering about the difference between what @joeflack4 calls Medgen UIDs and UMLS:CN terms. Both identifier schemes seem to be generated by Medgen - why are they distinct?

All records in MedGen are assigned a UID; that is an NCBI-wide identifier scheme that is used by our Entrez indexing system.
All records in MedGen also must have a CUI. THe CUIs primarily come from UMLS, but not all concepts in MedGen also exist in UMLS. So for the records without a UMLS CUI, we generate a CUI that starts with 'CN' to distinguish it from UMLS-generated CUIs.

Which do you consider the "offical external MEDGEN ID"?

The CUI, all records in MedGen will have either a unique C# or a unique CN#, but never both at the same time.

These questions are helpful as I finalize the slides for Friday, hopefully anything that isn't clear at this point will be clarified then. See you on the outreach call! :-)

joeflack4 commented 1 year ago

@matentzn

I don't really want MEDGEN: and MEDGEN_UID: as separate spaces

Roger that; I suppose there is no useful reason they need to be separated.


@kanems Thanks for your answers! I'll be at that meeting. I think I have a few simple questions I can ask there.

Megan: can you at a minimum remove the CUI/CN IDs from "MondoXrefs_NotinMedGen"

Definitely! As part of our approach ('approach 2'), we'll be discarding all prior UMLS/MedGen mappings in Mondo and replacing w/ new ones. I think this process will take several weeks. But if you think that there is any urgency, we can remove these specific mappings specifically without delay.

Regarding the tests / QC checks / conflict automation: I see; they run on your DB. It's likely the case that I will not have the bandwidth to replicate some or all of these tests on my end in the near term, but I would like to, especially since I've set up a GitHub action to fetch from FTP and run the ingest weekly, and it would be good to catch issues early. I'll make a GitHub issue for myself to try and stay on top of that.

matentzn commented 1 year ago

Since the CUI and UID ids have a 1:1 correspondence, we can work out a step by step plan for this.

I think UMLS:CN# entries are misleading in general. The fact that they are "sort of like" UMLS:C# mapping I think is not strong enough to publish them as UMLS mappings. So we may need a separate identifier space after all, e.g. MEDGEN_CUI:. This contradicts what I said earlier, but I now understand the situation a bit better. So concretely, I would to the following steps:

@kanems, to manage this process more easily in the future, would you and your team be willing to publish a specific, standardised table that we can use to make this process easier in the future? Your tables would have to be slightly reformatted only..

joeflack4 commented 1 year ago

@kanems Hey Megan~. Just wanted to give you a quick update. I've processed all of your conflicts and recommendations in this PR. It's now in the review / merge stage.

I was wondering if you could take a look at this comment. Basically, some of the new mappings you recommended were not found in MedGenIDMappings.txt as MedGen terms mapped directly to Mondo. Wondering if maybe that file needs updating.

kanems commented 1 year ago

@joeflack4 I can look through the IDs that weren't explicit matches in the MedGen report and see if there is still some clean up needed on our end. From the other comment you linked to, you are correct that our ad hoc analyses to generate the list of conflicts is separate from the MedGenIDs file, so I am not surprised that there were some gaps. I'll see what I can learn from the gaps to inform the curation and see what maybe missing from our FTP file. I know some of our FTP files have issues, but I thought we were properly generating the IDs file; having explicit examples of missing data may help our developers with the data issue, but we are quite backed up with developer tasks so I don't know how long it'll be before we could look into it. But overall, it's less than 1% that seems to be a difference between what you could process from the FTP data and the specific conflicts we reported, so that's fantastic.

joeflack4 commented 1 year ago

Sounds good. I'm not terribly worried about it at the moment. Whenever you have time! If you guys have a GitHub repo where you want me to open a ticket about this, I'd be happy to.

Regarding that list of IDs, I actually linked the wrong comment earlier. Here's the correct comment. And here's the list of IDs extracted from the .zip in that comment:

medgen-mondo matches in megans conflicts file but not in MedGenIDMappings.txt

Of the 278 entries, only 4 are CUIs and the rest are CNs.