konradjk / exac_browser

Browser for ExAC consortium data
http://exac.broadinstitute.org
MIT License
106 stars 54 forks source link

Discrepancy between Website and VCF File for download #182

Closed Steven-N-Hart closed 9 years ago

Steven-N-Hart commented 9 years ago

All, I found a discrepancy between how variants are reported on the Website versus the VCF. The example on the website is here. This is the correct nomenclature for the variant. However, when you look at the VCF, the Ref Allele gets malformed because it happens to be a multi-allele where the Ref-Allele of another variant completely overlaps the other Ref.

From the r.03 VCF:

The REF allele of the first Alt is CTCACAGACTGATGACTCACAGGGGTCACAGACTGATGACCCACAGGGGTCAGGGTCTTTTCCCCAGGGG, but the second Alt (C) should be CTCACAGACTGATGACTCACAGGGG.

konradjk commented 9 years ago

The example you point out is a tricky one, but it is consistent. The link you point out is actually the first alternate allele (deletion of TCACAGACTGATGACTCACAGGGG) - this sort of overlap indeed happens with overlapping indels in multi-allelic variants. When simplifying the variants by allele (we use a script to create what we call a minimal representation as described here: http://www.cureffi.org/2014/04/24/converting-genetic-variants-to-their-minimal-representation/ ), the first allele comes out to CTCACAGACTGATGACTCACAGGGG -> C (this can be recovered by cutting from each side). The second alt allele (C) is then http://exac.broadinstitute.org/variant/1-111957501-CTCACAGACTGATGACTCACAGGGGTCACAGACTGATGACCCACAGGGGTCAGGGTCTTTTCCCCAGGGG-C (deletion of the entire section) while the 3rd is simply a SNP (http://exac.broadinstitute.org/variant/1-111957501-C-A).

While this occasionally creates horrific entries in the VCF, these are actually valid representations: for instance, GA -> G,GAA would be valid, even though the 2nd alt allele is actually A->AA. I've added a short description of this to the FAQ.