macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

Fix incorrect ExAC parsing and enable using older ClinVar data #18

Closed kristjaneerik closed 8 years ago

kristjaneerik commented 8 years ago

This PR fixes a bug in add_exac_fields.py, where currently the ExAC VCF is assumed to have a particular set of fields in a particular order (given by NEEDED_EXAC_FIELDS). This is not the case in reality, and the currently generated clinvar_with_exac.tsv.gz file is incorrect.

Looking at some of the ExAC-derived columns in the currently posted dataset:

     chrom       pos ref alt  mut      AC    AN      AF
2        1    949739   G   T  ALT     1.0   1.0     1.0
3        1    955597   G   T  ALT  3786.0  17.0  1654.0
4        1    955619   G   C  ALT   314.0   1.0   262.0
337      1   5934950   C   T  ALT    26.0   1.0    26.0
1032     1  20975547   G   A  ALT     1.0   0.0     1.0
1514     1  40562882   A   T  ALT     9.0   0.0     9.0

The AC column should contain the alternate allele counts and the AN the total number of alleles, so we expect AC <= AN and AF = AC / AN; however, these number don't make sense. In fact, they are simply other columns from the ExAC VCF.

With this PR, the results with ExAC r0.3.1 are:

     chrom       pos ref alt  mut        AC        AN        AF
2        1    949739   G   T  ALT       1.0  121392.0  0.000008
3        1    955597   G   T  ALT    3786.0   16552.0  0.229000
4        1    955619   G   C  ALT     314.0   23028.0  0.014000
337      1   5935162   A   T  REF   99708.0  120838.0  0.825000
1032     1  21889635   T   C  REF  114692.0  121394.0  0.945000
1514     1  40735817   T   G  REF   12794.0  121400.0  0.105000

This PR also enables specifying the local ClinVar XML and variant summary table filenames. This was required because the latest version of variant_summary.txt.gz is missing the headers expected by join_data.R.

In particular, the headers for 2016-09 (<) compared to 2016-10 (>):

> ClinSigSimple
< HGVS(c.)         # required for join_data.R
< HGVS(p.)         # required for join_data.R
> HGNC_ID
> OriginSimple
< PhenotypeIDs
> PhenotypeIDS
> PhenotypeList
< VariantID        # required for join_data.R
< nsv (dbVar)
> nsv/esv (dbVar)

If this is not a glitch and ClinVar will continue not having the required columns, join_data.R will have to be rewritten to accommodate the new format.

The --clinvar-xml/-X flag can be used to specify the local ClinVarFullRelease.xml file and --clinvar-variant-summary-table/-S flag for the local variant_summary.txt.gz file.

The datasets included in the PR were generated with the 2016-09 release and ExAC r0.3.1.

One minor tweak was to also replace all instances of ; in the submitter name with , as ; is used as the delimiter in the final all_submitters column.

Since it wasn't clear from the README that htslib and vt are required, I've added notes on this to the README and explicit checks for the existence of required executables in master.py.

bw2 commented 8 years ago

Thank you for the PR and these fixes + updates.

Indeed, join_data.R will need to be rewritten, and I don't yet see a good way to make it work with the new format of variant_summary.txt.gz