monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

Improvement of lexical matches #78

Open matentzn opened 2 years ago

matentzn commented 2 years ago

Issues with the current matching to deal with

nicolevasilevsky commented 2 years ago

I have a list of mappings I didn't agree with here.

NV to do:

For ICD10 mappings

nicolevasilevsky commented 2 years ago

For orphanet mappings

hrshdhgd commented 8 months ago
  • [ ] missed synonyms (bag of words matching could be better than relying on synonyms?): neuroleptic malignant syndrome -> Malignant neuroleptic syndrome not caught

This is untrue: doid.sssom.tsv line 24163

Here's a table representing the data you've provided:

subject_id subject_label predicate_id object_id
DOID:14464 Neuroleptic Malignant Syndrome oboInOwl:hasDbXref ICD10CM:G21.0

Also icd10cm_mapping_status.tsv line 95175

subject_id subject_label is_mapped is_excluded is_deprecated
ICD10CM:G21.0 Malignant neuroleptic syndrome True False False

The semapv:UnspecifiedMatching

That was the placeholder we decided to put there at the time. If there is some other class that makes sense , please suggest.

Broken encodings Waldenström macroglobulinemia - any smart idea how to handle with now access to the source? Here we need a smart way. What comes to mind is replacing broken chars with regex wildcards like Waldenstr.m macroglobulinemia.

This could be a solution. The question is where should this ending code lie?

>>> incorrect_string = "Waldenström macroglobulinemia"
>>> bytes_string = incorrect_string.encode('latin1')
>>> correct_string = bytes_string.decode('utf-8')
>>> print(correct_string)
Waldenström macroglobulinemia

obsolete X -[skos:exactMatch]-> X we should match these despite their obsoletion

I think it already does for a few. E.g.:

mondo_exactmatch_icd10cm.tsv subject_id subject_label predicate_id object_id object_label mapping_justification mapping_tool confidence subject_match_field object_match_field match_string comment
MONDO:0024297 obsolete nutritional or metabolic disease skos:exactMatch ICD10CM:E00-E90 semapv:UnspecifiedMatching MONDO_MAPPINGS

disorder vs disorders (plural wordforms - do not manually implement, use some kind of NLP packages)

An example would help. I tried looking this up but didn't come across unmapped ones that had to be mapped.

other specified X, other unspecified X --[skos:broadMatch]->X

Again, an example would help.

@hrshdhgd preprocessing step in synonimizer: in ICD, if the label is other X, add broad synonym X

This has been done.

matentzn commented 8 months ago

@hrshdhgd can you move the entire content of this issue into a well-structured Google docs (headlines for each of the lexical-matching optimisations)? I think we should have all the different lexical optimisations discussed a bit and GitHub is terrible for this. Just post the link to the docs here, and I will answer to all your questions in there.