macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

Fix_Aug2017 #41

Closed XiaoleiZ closed 7 years ago

XiaoleiZ commented 7 years ago

Fix the bugs reported in Issues and referees report

The following key changes are added:

In parse_clinvar_xml.py

  1. Adding columnsstart,stop and strand for variant representation. Fix https://github.com/macarthur-lab/clinvar/issues/36
  2. Adding columns pathogenic,likely_pathogenic,uncertain_significance,likely_benign and benign (the standard terms used by ACMG guideline) to record the counts of individual submissions reported the variants as "Pathogenic","Likely pathogenic","Uncertain significance","Likely benign" and "Benign" (ignore cases) respectively. It is worth noting that the previous pathogenic and benign columns encoding the binary information are replaced. Fix https://github.com/macarthur-lab/clinvar/issues/40
  3. Adding column scv to list all the scv accession number of individual submissions
  4. Changing columns names: all column names with prefix measureset are replaced with variation since the latter are more familiar with ClinVar users.
  5. Changing the way to extract gene symbol: using the symbol used in the variant name/title. Fix https://github.com/macarthur-lab/clinvar/issues/37 and https://github.com/macarthur-lab/clinvar/issues/31

In group_by_allele.py:

  1. Adding the counts for each term in pathogenic,likely_pathogenic,uncertain_significance,likely_benign and benign

when joining variant_summary.txt file:

  1. Replacing the R script using a Python equivalent. Fix https://github.com/macarthur-lab/clinvar/issues/35
  2. Changing the way to encode column conflicted: according to the updated terms used in ClinVar aggregated variation reports, conflicted is changed to indicate whether the variation is aggregated to report as Conflicting interpretations of pathogenicity. Fix https://github.com/macarthur-lab/clinvar/issues/40
  3. Propagating the columns like last_evaluated, submitters_ordered and etc. Fix https://github.com/macarthur-lab/clinvar/issues/38
  4. Remove the duplicated records in variant_summary before joining: the variant_summary file is indeed not allele_id-specific. Variants with alternative loci like in PAR or complex variation like translocation would have more than one genomic coordinates but same allele_ids. The alternative loci would be recorded as another entry in variant_summary file. I just simply remove the duplicated records after extracting the interesting columns from variant_summary. Currently, only one of the sequence locations of these variants are kept after parsing the xml file. There is still problems in handling these type of variants with current pipeline: e.g the variants in PAR are represented on Y chromosome and would not be able to find the variant info from ExAC and gnomAD. And for complex variation like translocation, just one allele is represented in final output files. Since these are rare cases, I am not sure how to deal with them uniformly. For the variants with alternative loci, there is a separate VCF file available for download on ClinVar FTP . Fix https://github.com/macarthur-lab/clinvar/issues/39

In add_gnomad_field.py and add_exac_field.py:

  1. Adding the DP - approximate read depth for users to query about the coverage info
konradjk commented 7 years ago

FWIW we just ran this code as-is and it worked totally fine! Might want to merge since master definitely does not work on the current clinvar xml

bw2 commented 7 years ago

@XiaoleiZ should we merge this into master?