This PR fixes a bug in add_exac_fields.py, where currently the ExAC VCF is assumed to have a particular set of fields in a particular order (given by NEEDED_EXAC_FIELDS). This is not the case in reality, and the currently generated clinvar_with_exac.tsv.gz file is incorrect.
Looking at some of the ExAC-derived columns in the currently posted dataset:
chrom pos ref alt mut AC AN AF
2 1 949739 G T ALT 1.0 1.0 1.0
3 1 955597 G T ALT 3786.0 17.0 1654.0
4 1 955619 G C ALT 314.0 1.0 262.0
337 1 5934950 C T ALT 26.0 1.0 26.0
1032 1 20975547 G A ALT 1.0 0.0 1.0
1514 1 40562882 A T ALT 9.0 0.0 9.0
The AC column should contain the alternate allele counts and the AN the total number of alleles, so we expect AC <= AN and AF = AC / AN; however, these number don't make sense. In fact, they are simply other columns from the ExAC VCF.
With this PR, the results with ExAC r0.3.1 are:
chrom pos ref alt mut AC AN AF
2 1 949739 G T ALT 1.0 121392.0 0.000008
3 1 955597 G T ALT 3786.0 16552.0 0.229000
4 1 955619 G C ALT 314.0 23028.0 0.014000
337 1 5935162 A T REF 99708.0 120838.0 0.825000
1032 1 21889635 T C REF 114692.0 121394.0 0.945000
1514 1 40735817 T G REF 12794.0 121400.0 0.105000
This PR also enables specifying the local ClinVar XML and variant summary table filenames. This was required because the latest version of variant_summary.txt.gz is missing the headers expected by join_data.R.
In particular, the headers for 2016-09 (<) compared to 2016-10 (>):
> ClinSigSimple
< HGVS(c.) # required for join_data.R
< HGVS(p.) # required for join_data.R
> HGNC_ID
> OriginSimple
< PhenotypeIDs
> PhenotypeIDS
> PhenotypeList
< VariantID # required for join_data.R
< nsv (dbVar)
> nsv/esv (dbVar)
If this is not a glitch and ClinVar will continue not having the required columns, join_data.R will have to be rewritten to accommodate the new format.
The --clinvar-xml/-X flag can be used to specify the local ClinVarFullRelease.xml file and --clinvar-variant-summary-table/-S flag for the local variant_summary.txt.gz file.
The datasets included in the PR were generated with the 2016-09 release and ExAC r0.3.1.
One minor tweak was to also replace all instances of ; in the submitter name with , as ; is used as the delimiter in the final all_submitters column.
Since it wasn't clear from the README that htslib and vt are required, I've added notes on this to the README and explicit checks for the existence of required executables in master.py.
This PR fixes a bug in
add_exac_fields.py
, where currently the ExAC VCF is assumed to have a particular set of fields in a particular order (given byNEEDED_EXAC_FIELDS
). This is not the case in reality, and the currently generatedclinvar_with_exac.tsv.gz
file is incorrect.Looking at some of the ExAC-derived columns in the currently posted dataset:
The
AC
column should contain the alternate allele counts and theAN
the total number of alleles, so we expectAC <= AN
andAF = AC / AN
; however, these number don't make sense. In fact, they are simply other columns from the ExAC VCF.With this PR, the results with ExAC r0.3.1 are:
This PR also enables specifying the local ClinVar XML and variant summary table filenames. This was required because the latest version of
variant_summary.txt.gz
is missing the headers expected byjoin_data.R
.In particular, the headers for 2016-09 (
<
) compared to 2016-10 (>
):If this is not a glitch and ClinVar will continue not having the required columns,
join_data.R
will have to be rewritten to accommodate the new format.The
--clinvar-xml/-X
flag can be used to specify the localClinVarFullRelease.xml
file and--clinvar-variant-summary-table/-S
flag for the localvariant_summary.txt.gz
file.The datasets included in the PR were generated with the 2016-09 release and ExAC r0.3.1.
One minor tweak was to also replace all instances of
;
in the submitter name with,
as;
is used as the delimiter in the finalall_submitters
column.Since it wasn't clear from the README that
htslib
andvt
are required, I've added notes on this to the README and explicit checks for the existence of required executables inmaster.py
.