Open TomConlin opened 8 years ago
We need to have a MONDO mapping pipeline. @cmungall code for running the reports for ClinGen might be modified for this purpose. https://github.com/monarch-initiative/monarch-disease-ontology/tree/master/src
Ideal behavior is to have a rule that takes into account text mapping confidence of the SCV submitter diseases and clinVar xrefs to the parent RCV?
medgen available here: ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/
just to be clear, if we map Medgen to MONDO, we'd be mapping SCVs to a MONDO term via the RCV ClinVar curation, rather than the SCV directly.
File has unique MedGen identifiers found associated with RCV traits in the ClinVar xml file sorted by frequency.
here's the breakdown by category. The 70 classes in Finding should be phenotypes
752 Disease-or-Syndrome
74 Neoplastic-Process
70 Finding
41 Congenital-Abnormality
13 Pathologic-Function
9 Mental-or-Behavioral-Dysfunction
4 Sign-or-Symptom
4 Anatomical-Abnormality
3 Gene-or-Genome
ClinVar provides a Disease/trait database and identifier for each 'RCV', which is a collection of one or more 'SCV' each of which include a free text field with what the submitter indicated the disease/trait for their variant is. I would like to find the disease/trait ontological term Monarch would have preferred they had used. The expectation is not all SCV will remain together in the same RCV groupings.
Some submitter traits are usable as is, others vary a bit. To get strings that vary a bit into the same pile as a canonical version, I have tried clustering by computing a threshold for triglyph similarity scores beween all pairs of traits.
preliminary clusters may be found here: https://gist.github.com/TomConlin/57a57ea2da96af745c62bd923cd5dca7
Notes: I did not attempt to filter out anything, specifically not whether CV indicated if the record was for a disease or not, so there are terms which are more phenotype than disease.
I also do not include degenerate clusters of one trait (about 2,500 with this threshold) and finally, clusters which are too diverse may be recursively reclustered.
@cmungall