monarch-initiative / mondo

Mondo Disease Ontology
http://obofoundry.org/ontology/mondo
Creative Commons Attribution 4.0 International
225 stars 53 forks source link

MedDRA to MONDO cross-references #3122

Open ireneisdoomed opened 3 years ago

ireneisdoomed commented 3 years ago

Hi,

from an analysis of FAERS data, we have obtained a list of 1,770 MedDRA to MONDO mappings that we believe have a very high level of confidence: meddra-mondo-xref-0995-1.json.zip

This is part of an emerging pipeline we are working on to integrate the different methods of automatic disease mapping into a single tool.

The algorithm performs very well as long as it is already captured in the ontology. However, we are aware of a systemic error when the MedDRA label refers to a less common/more severe disorder in a set of diseases and we map it to the closest EFO term, e.g:

{"traitId":"10032168","traitName":"Other insomnia","score":1,"efoId":"EFO_0004698","efoName":"insomnia"}
{"traitId":"10031712","traitName":"Other cardiovascular syphilis","score":0.999,"efoId":"EFO_1001206","efoName":"syphilitic aortitis"}
{"traitId":"10032033","traitName":"Other endocarditis","score":1,"efoId":"EFO_0000465","efoName":"endocarditis"}

Could you please examine the mappings and consider how best to include this annotation in the ontology? The SPOT team has already imported the MedDRA xrefs whose labels match an EFO/Orphanet term and are now in the process of curating the ones of less confidence. Both EFO and Open Targets would benefit a lot if you could include them similarly.

Any feedback or comments is much appreciated.

Thanks!

ireneisdoomed commented 3 years ago

Hi! Could you please provide some follow-up on this issue? I would love to know if you have this work on the scope for the coming months. Please let me know whether you have any comments or questions. Thanks!

matentzn commented 3 years ago

Hey @ireneisdoomed this is great work - but human review is extremely costly. I am afraid it is difficult to verify all these mappings in one go. To get this going would you be able to supply the mappings as an SSSOM tsv? The JSON file cannot really be handled by our curator team, and we use Google Sheets for all review activities.

Here are some examples: https://github.com/mapping-commons/mh_mapping_initiative/tree/master/mappings

This will enable me to share the mappings on google sheets and have some of them reviewed.

Also, it would be hugely informative to know the mapping justification: why is a match considered a match? Also, give us a bit more metadata so we can accurately attribute the match to you and your team. What is the tool you used, which version? Who is the creator/team that published the mapping?

I tried to reach you via email as well, but now I lost access to that (thanks to the new EBI regulations with VPN) - I would still be interested to talk to you about your efforts to sync up a bit and avoid redoing to much work on our end (if you do all this work, we dont have to :P).

nicolevasilevsky commented 3 years ago

Update - Irene will provide a SSSOM file with mappings.

ireneisdoomed commented 2 years ago

Hello!

As @matentzn suggested in our conversation in August, we want to be able to accurately establish what are the levels of correspondence between terms so I've mapped our values of confidence to the corresponding predicates as follows:

OT confidence EFO relation SSOM predicate
1.0 exactName owl:equivalentClass
0.999 exactSynonyms skos:exactMatch
0.998 narrowSynonyms skos:narrowMatch
0.997 broadSynonyms skos:broadMatch
0.996 relatedSynonyms skos:relatedMatch

To give you an idea of the quality of these MedDRA to MONDO mappings, here is an extract of how the predicates are distributed across the ≈ 2 000 xrefs:

predicate_id count
skos:exactMatch 949
owl:equivalentClass 931
skos:relatedMatch 129
skos:narrowMatch 32
skos:broadMatch 9

Please let me know if you think the mappings are not equivalent. Here is the table in the SSOM format: meddra-mondo-xref.csv.zip

Attribution Author: Miguel Carmona ORCID iD: 0000-0002-7582-4771 Tool: opentargets-disease-meddra-word2vec v1 Organisation: Open Targets

matentzn commented 2 years ago

Wow, supplying this file in sssom format is so super cool. We are also about to publish the first stable release of the spec, for example: https://mapping-commons.github.io/sssom/Mapping/

The file looks good, but just to double check on your curation rules:

Some questions about that.

  1. You dont do anything if two exact synonyms match? if is it just the case that for meddra, you do not have synonyms?
  2. Does this list include only mappings that are not already in mondo, or are these all matches you found?
  3. What is the preprocessing going on on the labels/synonym fields? Do you do:
    • lower casing
    • stemming
    • anything else?

Very excited to see this, thank you!

ireneisdoomed commented 2 years ago

Hi @matentzn! Yes, that is exactly the rationale the curation follows.

  1. We haven't explored that. Correct me if I am wrong, but we don't have access to such synonyms since the MedDRA ontology is licensed.
  2. These are all the matches we have found, i.e. a MedDRA term that our algorithm maps to an existing MONDO term.
  3. There is a normalisation process being done. As you say: lower casing, stemming, lemmatization. You can see the code here: https://github.com/opentargets/platform-etl-backend/blob/master/scripts/opentargets-disease-meddra-word2vec.sc

I hope that helps!

matentzn commented 2 years ago

Thank you @ireneisdoomed I have this on my list of things to deal with now, but it may take a few weeks before I get to it.

Very excited about all that and thank you for this contribution!

Just a few questions: 1) Would you be willing to keep this mapping "up to date", i.e. deploy the sssom file somewhere public and run the mapping code periodically to catch new and changed terms? 2) If I were to assist, would you be willing to provide additional metadata to the table (columns, etc)? 3) completely independent of the above, would you be able to provide a list of mappings that are not already in Mondo?

Thank you!

ireneisdoomed commented 2 years ago

Thank you @matentzn, I will be very happy to know your comments.

  1. Sure, this is something that we can do. These MedDRA terms are picked up from our FAERS pipeline in the first place. We could implement a systematic cross-reference lookup if that is beneficial for all of us.
  2. Yes.
  3. I am not sure I understand what you mean. Do you want a list of the MedDRA terms that we have not been able to accurately enough map to any ontology?
matentzn commented 2 years ago

Re 3: I meant if there is a mapping MONDO:001 --[skos:exactMatch]--> Meddra:001 that is already in Mondo itself, would it appear in your mapping table? or did you already remove all the mappings that are already in Mondo? Anyways, so not worry :) Just wondering.

ireneisdoomed commented 2 years ago

Thank you for the clarification. I don't think any checks of whether the Xref is already present were done.

matentzn commented 2 years ago

@ireneisdoomed thank you!

@joeflack4 this is a very important ticket to me, but I dont want you to get too sidetracked. Could you do a quick python script taking @ireneisdoomed spreadsheet, and comparing it to

INPUT: Irenes Table, and the three tables above, OUTPUT: Irenes Table with only the rows that are not in any of these three files

For now, the key should just be subject_id, object_id, ignoring predicate id.

matentzn commented 2 years ago

This ticket is not urgent, just important, there is no tag for that :)

nicolevasilevsky commented 2 years ago

Feel free to create a new label (we have 'high priority' and 'low priority'. Should we have a mid priority? Or we can assign it to a milestone, like Feb 2022?

matentzn commented 2 years ago

Sounds good!

joeflack4 commented 2 years ago

@matentzn Understood. Also thanks, those are very concise instructions; exactly what I'm looking for.

There is one issue that I see, though. The formatting for object_id is different in between meddra-mondo-xref.csv and the 3 TSV files.

Example IDs: meddra-mondo-xref.csv

10004906
10007113
10011000
10011501

One of the TSV files (all of them have same formatting)

MESH:D006319
MESH:C536366
MESH:C567574
MESH:C567575

How to compare objects?

a. object_id

I'm assuming these are entirely different ID systems? If I remove all but the 6 characters on the right, they're both numeric, but looking at a few samples, I think they're just different ID systems. So I don't know if this is possible.

b. formatted object_label, exact match

If that's the case, should I then instead match on the specially formatted object_label (i.e. lowercase and remove special chars)?

c. formatted object_label, high string similarity

If there are likely to be some differences in the labels even if I lowercase both of them, I could compare based on string similarities between formatted object_label in both sets, considering them to be matched only if they meet a very high threshold, something like 95%~ similarity.


What do you think?


_Edit 2021/10/22: _Nico updated his comment, these are Medra IDs, not MESH.

ireneisdoomed commented 2 years ago

Hi @matentzn @joeflack4,

how is this work going? I'm interested to know if you have any initial thoughts on the quality of the data.

Thanks!

matentzn commented 2 years ago

Hey @ireneisdoomed - we are at the moment all in on getting the Mondo paper done, so there is a bit of delay of addressing the mapping issues.. I am sorry. Please indicate if there is a rush for something that you or Open Targets specifically need urgently, else we will resume working on mappings after Christmas!

Apologies!

matentzn commented 2 years ago

This is how we will conceptually work with mappings like meddra:

  1. We add them as source mappings into disease mapping commons (https://github.com/mapping-commons/disease-mappings/issues/13)
  2. Mondo slurps harmonised mappings up and adds them back into Mondo.owl

We are getting closer to working in this cycle, but Meddra is not yet a priority, but will be later in the year, after:

  1. ICD10CM
  2. ORDO/OMIM (not much to do)
  3. DO
  4. NCIT
  5. (possibly SNOMED)
  6. Meddra (this ticket)

Sorry @ireneisdoomed this is taking so long, we are doing what we can :)

ireneisdoomed commented 2 years ago

Hi @matentzn. Thanks for the update. Could you tell me if the plan is still to review these manually?

Please let me know when you start working in this so that I can rerun the pipeline and provide a more up to date table.

matentzn commented 2 years ago

The way this works is like this:

  1. We cycle all mapping candidates through an ontology merging pipeline (KBoom).
  2. This will reveal all mappings that cause contradictions.
  3. These mappings will be manual reviewed.
  4. Mappings are accepted if no contradictions are left.

So no matter what, not all mappings are reviewed 1:1 - this is not possible with more that 150K mappings we need to manage.

sabrinatoro commented 3 months ago

@matentzn can we close this issue or is there more to be done? Thanks!

matentzn commented 3 months ago

This ticket is not done but can be moved to mondo ingest repo.

It needs to be included in any concerted efforts integrating Meddra again.