macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

List Order Mismatch between Clinical_Sig & All_Summiters #25

Closed raymond301 closed 7 years ago

raymond301 commented 7 years ago

Example: Variant=chr1:976059_C>T ID=RCV000195231

The result in your clinvar_alleles.tsv: clinical_significance="Likely benign;Uncertain significance" all_submitters="Genetic Services Laboratory, University of Chicago;PreventionGenetics"

If you look at the order of the list....which would be useful....Likely benign was submitted by U of Chicago. But, that is not the case: https://www.ncbi.nlm.nih.gov/clinvar/RCV000195231/

That's just one example, there are many, many, many more.

I can see where this comes from. Regex and the XML structure. In script: parse_clinvar_xml.py:104

current_row['all_submitters'] = ';'.join([
            submitter_node.attrib['submitter'].replace(';', ',')
            for submitter_node in elem.findall('.//ClinVarSubmissionID')
            if submitter_node.attrib is not None and submitter_node.attrib.has_key('submitter')
        ])

The "submitters" is obtained from a separate node, without any attempt to match against the nested clin_sig description.

clinical_significance=elem.find('.//ReferenceClinVarAssertion/ClinicalSignificance')
    if clinical_significance.find('.//ReviewStatus') is not None:
        current_row['review_status']=clinical_significance.find('.//ReviewStatus').text;
    if clinical_significance.find('.//Description') is not None:
        current_row['clinical_significance']=clinical_significance.find('.//Description').text

If I had a solution worked out, I would make a pull request. But it appears to tricky, so far.

raymond301 commented 7 years ago

I created a pull request for this. Not replacing your columns, but simply adding 2 new columns to report back the submitter specific clin. sig. & review status. #28