monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

ClinVar Disease from submitter's text #281

Open TomConlin opened 8 years ago

TomConlin commented 8 years ago

ClinVar provides a Disease/trait database and identifier for each 'RCV', which is a collection of one or more 'SCV' each of which include a free text field with what the submitter indicated the disease/trait for their variant is. I would like to find the disease/trait ontological term Monarch would have preferred they had used. The expectation is not all SCV will remain together in the same RCV groupings.

Some submitter traits are usable as is, others vary a bit. To get strings that vary a bit into the same pile as a canonical version, I have tried clustering by computing a threshold for triglyph similarity scores beween all pairs of traits.

preliminary clusters may be found here: https://gist.github.com/TomConlin/57a57ea2da96af745c62bd923cd5dca7

Notes: I did not attempt to filter out anything, specifically not whether CV indicated if the record was for a disease or not, so there are terms which are more phenotype than disease.

I also do not include degenerate clusters of one trait (about 2,500 with this threshold) and finally, clusters which are too diverse may be recursively reclustered.

@cmungall

mellybelly commented 8 years ago

We need to have a MONDO mapping pipeline. @cmungall code for running the reports for ClinGen might be modified for this purpose. https://github.com/monarch-initiative/monarch-disease-ontology/tree/master/src

Ideal behavior is to have a rule that takes into account text mapping confidence of the SCV submitter diseases and clinVar xrefs to the parent RCV?

mellybelly commented 8 years ago

medgen available here: ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/

mellybelly commented 8 years ago

just to be clear, if we map Medgen to MONDO, we'd be mapping SCVs to a MONDO term via the RCV ClinVar curation, rather than the SCV directly.

TomConlin commented 8 years ago

File has unique MedGen identifiers found associated with RCV traits in the ClinVar xml file sorted by frequency.

medgen_count.txt

cmungall commented 8 years ago

here's the breakdown by category. The 70 classes in Finding should be phenotypes

752     Disease-or-Syndrome
74      Neoplastic-Process
70      Finding
41      Congenital-Abnormality
13      Pathologic-Function
9       Mental-or-Behavioral-Dysfunction
4       Sign-or-Symptom
4       Anatomical-Abnormality
3       Gene-or-Genome