Closed TomConlin closed 8 years ago
Looking at the TSV file, theres no way to disentangle variant - pathogenicity - disease relationships in cases where there are many to many relationships. I think our best bet is to get the xml parser up and running.
Yes, the TSV munges all clinsigs and diseases associated with a given variant into single cells. Kent and I tried various ways to map each clinsig term for a given variant to relevant disease, but concluded this was not possible. As odd as it sounds, the TSV dump just wont allow us to unambiguously map a variant to its pathogenicity call for a specific disease.
As a result, we will need to get the XML ingest into SciGraph, as Kent suggests above. I will add modeling to the cmap for any new data elements we need to add to the XML ingest to replicate TSV coverage, and @TomConlin can update dipper script accordingly.
Glad I found it difficult as well. closing, as we will not update the tab file ingest
We are leaving the existing ClinvVar ingest for the GSA release. @kshefchek & @mbrush noted we are attaching diseases to variants with benign clinical significance. We should only associate diseases for 'pathogenic' or 'likely pathogenic' designations.
the first task is to decide which of their strings of freetext correspond to these conditions. This file has the strings roughly separated with the more likely to be associated with disease towards the beginning:
CVTab_clin_sig.txt
@mbrush please formalize this list into those which will and won't be associated with a disease