Open alecw opened 6 years ago
Dear @alecw, We have started working on your issue recently. During the work, we reproduced your bug and analyzed the .vcf file that you used during validation and subsequent conversion to the .bcf format. We would like to clarify the order of calling the utilities of GATK, which you used to get the final .vcf file. If we correctly understood, you called HaplotypeCaller (see the documentetion here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php) to get the file into the .g.vcf format. After that, you call the utility VariantFiltration(see the documentation here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php) to filter variant calls based on INFO and FORMAT annotations. After that, your original .g.vcf file was converted to .vcf format. This transformation, unfortunately, is not valid at the moment. As a result, you get a file in the .vcf format, which contains the attributes of the .g.vcf file. In general, according to the documentation of the gatk: “GVCF stands for Genomic VCF. A VCF is a form of VCF (see the spec documentation here: https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-agvcf-and-how-is-it-different-from-a-regular-vcf ), but a Genomic VCF contains extra information ”. As a result, due to the fact that ValidateVariants checks only fields and attributes corresponding to the .vcf file, your file is validated and an error occurs only at the conversion stage.
Dear @lbergelson @davidbenjamin, As a possible solution we suggest to add extra validation at least by extension to the ValidateVariants class of GATK (link to the Github: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/ValidateVariants.java). This can prevent errors like the one that occurred in this issue.
I'm not on the engine team and Louis is on paternity leave. I'll let @droazen and @jamesemery sort this out.
Looks like there are a few things going on here. The ValidateVariants
command above excludes all validation types:
gatk-launch ValidateVariants -R /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta -V buggy.vcf --validationTypeToExclude ALL
so no actual validation was being done there (this should probably issue a warning that no validation is being done).
But the VCF doesn't pass validation; there are problems with both AN and AD. The single variant in the test file has 3 values for AD (but there are only two alleles - ref and one alt); the BCF encoder should probably have have detected this, but it doesn't and happily encodes all three values. On decode, the BCF decoder believes the header (which specifies the AD count as R, so one ref and one alt) and only decodes two values, leaving the third value (which happens to be 0) in the stream to be consumed for the next attribute, resulting in the NPE.
Hi @cmnbroad ,
Note that if --validationTypeToExclude ALL
is removed from command line, there are still no errors reported.
-Alec
@alecw Right - I'm not sure what the behavior should be in that case - you could argue no validation should be done if you exclude ALL, though thatss probably not that helpful. Either way, there is a separate ticket https://github.com/broadinstitute/gatk/issues/5862 to address the GATK tool behavior.
NullPointerException reading BCF
I have a VCF with a single variant line in it (buggy.vcf.txt Note that .txt extension should be removed to reproduce problem). It appears to be valid, i.e. it passed ValidateVariants. I then converted to BCF with VcfFormatConverter, and tried to validate the BCF. I got an NPE reading the BCF.
Your environment
Steps to reproduce
Tell us how to reproduce this issue. If possible, include a short code snippet to demonstrate the problem.
Expected behaviour
Validation of BCF should succeed.
Actual behaviour
NullPointerException validating VCF.
Details of invocations and output