Filter diseses from non pathogenic variants in ClinVar Tab

monarch-initiative / dipper

Data Ingestion Pipeline for Monarch

https://dipper.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

57 stars 26 forks source link

Filter diseses from non pathogenic variants in ClinVar Tab #331

Closed TomConlin closed 8 years ago

TomConlin commented 8 years ago

We are leaving the existing ClinvVar ingest for the GSA release. @kshefchek & @mbrush noted we are attaching diseases to variants with benign clinical significance. We should only associate diseases for 'pathogenic' or 'likely pathogenic' designations.
the first task is to decide which of their strings of freetext correspond to these conditions. This file has the strings roughly separated with the more likely to be associated with disease towards the beginning:
CVTab_clin_sig.txt

@mbrush please formalize this list into those which will and won't be associated with a disease

kshefchek commented 8 years ago

Looking at the TSV file, theres no way to disentangle variant - pathogenicity - disease relationships in cases where there are many to many relationships. I think our best bet is to get the xml parser up and running.

mbrush commented 8 years ago

Yes, the TSV munges all clinsigs and diseases associated with a given variant into single cells. Kent and I tried various ways to map each clinsig term for a given variant to relevant disease, but concluded this was not possible. As odd as it sounds, the TSV dump just wont allow us to unambiguously map a variant to its pathogenicity call for a specific disease.

As a result, we will need to get the XML ingest into SciGraph, as Kent suggests above. I will add modeling to the cmap for any new data elements we need to add to the XML ingest to replicate TSV coverage, and @TomConlin can update dipper script accordingly.

TomConlin commented 8 years ago

Glad I found it difficult as well. closing, as we will not update the tab file ingest