Adding columns pathogenic,likely_pathogenic,uncertain_significance,likely_benign and benign (the standard terms used by ACMG guideline) to record the counts of individual submissions reported the variants as "Pathogenic","Likely pathogenic","Uncertain significance","Likely benign" and "Benign" (ignore cases) respectively. It is worth noting that the previous pathogenic and benign columns encoding the binary information are replaced. Fix https://github.com/macarthur-lab/clinvar/issues/40
Adding column scv to list all the scv accession number of individual submissions
Changing columns names: all column names with prefix measureset are replaced with variation since the latter are more familiar with ClinVar users.
Remove the duplicated records in variant_summary before joining: the variant_summary file is indeed not allele_id-specific. Variants with alternative loci like in PAR or complex variation like translocation would have more than one genomic coordinates but same allele_ids. The alternative loci would be recorded as another entry in variant_summary file. I just simply remove the duplicated records after extracting the interesting columns from variant_summary. Currently, only one of the sequence locations of these variants are kept after parsing the xml file. There is still problems in handling these type of variants with current pipeline: e.g the variants in PAR are represented on Y chromosome and would not be able to find the variant info from ExAC and gnomAD. And for complex variation like translocation, just one allele is represented in final output files. Since these are rare cases, I am not sure how to deal with them uniformly. For the variants with alternative loci, there is a separate VCF file available for download on ClinVar FTP . Fix https://github.com/macarthur-lab/clinvar/issues/39
In add_gnomad_field.py and add_exac_field.py:
Adding the DP - approximate read depth for users to query about the coverage info
Fix the bugs reported in Issues and referees report
The following key changes are added:
In parse_clinvar_xml.py
start
,stop
andstrand
for variant representation. Fix https://github.com/macarthur-lab/clinvar/issues/36pathogenic
,likely_pathogenic
,uncertain_significance
,likely_benign
andbenign
(the standard terms used by ACMG guideline) to record the counts of individual submissions reported the variants as "Pathogenic","Likely pathogenic","Uncertain significance","Likely benign" and "Benign" (ignore cases) respectively. It is worth noting that the previouspathogenic
andbenign
columns encoding the binary information are replaced. Fix https://github.com/macarthur-lab/clinvar/issues/40scv
to list all the scv accession number of individual submissionsmeasureset
are replaced withvariation
since the latter are more familiar with ClinVar users.symbol
: using the symbol used in the variant name/title. Fix https://github.com/macarthur-lab/clinvar/issues/37 and https://github.com/macarthur-lab/clinvar/issues/31In group_by_allele.py:
pathogenic
,likely_pathogenic
,uncertain_significance
,likely_benign
andbenign
when joining variant_summary.txt file:
conflicted
: according to the updated terms used in ClinVar aggregated variation reports,conflicted
is changed to indicate whether the variation is aggregated to report asConflicting interpretations of pathogenicity
. Fix https://github.com/macarthur-lab/clinvar/issues/40last_evaluated
,submitters_ordered
and etc. Fix https://github.com/macarthur-lab/clinvar/issues/38allele_id
s. The alternative loci would be recorded as another entry in variant_summary file. I just simply remove the duplicated records after extracting the interesting columns from variant_summary. Currently, only one of the sequence locations of these variants are kept after parsing the xml file. There is still problems in handling these type of variants with current pipeline: e.g the variants in PAR are represented on Y chromosome and would not be able to find the variant info from ExAC and gnomAD. And for complex variation like translocation, just one allele is represented in final output files. Since these are rare cases, I am not sure how to deal with them uniformly. For the variants with alternative loci, there is a separate VCF file available for download on ClinVar FTP . Fix https://github.com/macarthur-lab/clinvar/issues/39In add_gnomad_field.py and add_exac_field.py: