samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
285 stars 243 forks source link

NullPointerException reading BCF #1117

Open alecw opened 6 years ago

alecw commented 6 years ago

NullPointerException reading BCF

I have a VCF with a single variant line in it (buggy.vcf.txt Note that .txt extension should be removed to reproduce problem). It appears to be valid, i.e. it passed ValidateVariants. I then converted to BCF with VcfFormatConverter, and tried to validate the BCF. I got an NPE reading the BCF.

Your environment

Steps to reproduce

Tell us how to reproduce this issue. If possible, include a short code snippet to demonstrate the problem.

  1. Validate VCF
  2. Convert VCF to BCF
  3. Validate BCF

Expected behaviour

Validation of BCF should succeed.

Actual behaviour

NullPointerException validating VCF.

Details of invocations and output

% gatk-launch ValidateVariants -R /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta -V buggy.vcf --validationTypeToExclude ALL
Using GATK jar /home/unix/nemesh/seq_tools/gatk-4.beta.5/gatk-package-4.beta.5-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /home/unix/nemesh/seq_tools/gatk-4.beta.5/gatk-package-4.beta.5-local.jar ValidateVariants -R /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta -V buggy.vcf --validationTypeToExclude ALL
16:13:48.426 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/unix/nemesh/seq_tools/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/intel/gkl/native/libgkl_compression.so
[May 16, 2018 4:13:48 PM EDT] ValidateVariants  --validationTypeToExclude ALL --variant buggy.vcf --reference /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta  --doNotValidateFilteredRecords false --warnOnErrors false --validateGVCF false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[May 16, 2018 4:13:48 PM EDT] Executing as alecw@sv01.broadinstitute.org on Linux 2.6.32-696.13.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Version: 4.beta.5
16:13:48.643 INFO  ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 1
16:13:48.643 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:13:48.648 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:13:48.648 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:13:48.649 INFO  ValidateVariants - Deflater: IntelDeflater
16:13:48.649 INFO  ValidateVariants - Inflater: IntelInflater
16:13:48.649 INFO  ValidateVariants - GCS max retries/reopens: 20
16:13:48.649 INFO  ValidateVariants - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
16:13:48.649 INFO  ValidateVariants - Initializing engine
16:13:49.393 INFO  FeatureManager - Using codec VCFCodec to read file file:///humgen/cnp04/sandbox/alecw/dropseq_runs/dropulation_census/buggy.vcf
16:13:49.453 INFO  FeatureManager - Using codec VCFCodec to read file file:///humgen/cnp04/sandbox/alecw/dropseq_runs/dropulation_census/buggy.vcf
16:13:49.488 INFO  ValidateVariants - Done initializing engine
16:13:49.488 INFO  ProgressMeter - Starting traversal
16:13:49.488 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
16:13:49.557 INFO  ProgressMeter -             unmapped              0.0                     1            882.4
16:13:49.557 INFO  ProgressMeter - Traversal complete. Processed 1 total variants in 0.0 minutes.
16:13:49.557 INFO  ValidateVariants - Shutting down engine
[May 16, 2018 4:13:49 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=171442176

% java -jar picard-2.18.4.jar  VcfFormatConverter i=buggy.vcf o=buggy.bcf REQUIRE_INDEX=false
16:08:31.191 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/humgen/cnp04/sandbox/alecw/dropseq_runs/dropulation_census/picard-2.18.4.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed May 16 16:08:31 EDT 2018] VcfFormatConverter INPUT=buggy.vcf OUTPUT=buggy.bcf REQUIRE_INDEX=false    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed May 16 16:08:31 EDT 2018] Executing as alecw@sv01.broadinstitute.org on Linux 2.6.32-696.13.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.4-SNAPSHOT
[Wed May 16 16:08:31 EDT 2018] picard.vcf.VcfFormatConverter done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=117440512

% gatk-launch ValidateVariants -R /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta -V buggy.bcf --validationTypeToExclude ALL
Using GATK jar /home/unix/nemesh/seq_tools/gatk-4.beta.5/gatk-package-4.beta.5-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /home/unix/nemesh/seq_tools/gatk-4.beta.5/gatk-package-4.beta.5-local.jar ValidateVariants -R /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta -V buggy.bcf --validationTypeToExclude ALL
16:08:46.281 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/unix/nemesh/seq_tools/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/intel/gkl/native/libgkl_compression.so
[May 16, 2018 4:08:46 PM EDT] ValidateVariants  --validationTypeToExclude ALL --variant buggy.bcf --reference /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta  --doNotValidateFilteredRecords false --warnOnErrors false --validateGVCF false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[May 16, 2018 4:08:46 PM EDT] Executing as alecw@sv01.broadinstitute.org on Linux 2.6.32-696.13.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Version: 4.beta.5
16:08:46.422 INFO  ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 1
16:08:46.422 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:08:46.422 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:08:46.422 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:08:46.422 INFO  ValidateVariants - Deflater: IntelDeflater
16:08:46.422 INFO  ValidateVariants - Inflater: IntelInflater
16:08:46.422 INFO  ValidateVariants - GCS max retries/reopens: 20
16:08:46.422 INFO  ValidateVariants - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
16:08:46.422 INFO  ValidateVariants - Initializing engine
16:08:46.971 INFO  FeatureManager - Using codec BCF2Codec to read file file:///humgen/cnp04/sandbox/alecw/dropseq_runs/dropulation_census/buggy.bcf
16:08:47.012 INFO  FeatureManager - Using codec BCF2Codec to read file file:///humgen/cnp04/sandbox/alecw/dropseq_runs/dropulation_census/buggy.bcf
16:08:47.036 INFO  ValidateVariants - Done initializing engine
16:08:47.036 INFO  ProgressMeter - Starting traversal
16:08:47.036 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
16:08:47.044 INFO  ValidateVariants - Shutting down engine
[May 16, 2018 4:08:47 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=175636480
java.lang.NullPointerException
    at htsjdk.variant.bcf2.BCF2LazyGenotypesDecoder.parse(BCF2LazyGenotypesDecoder.java:75)
    at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
    at htsjdk.variant.bcf2.BCF2Codec.createLazyGenotypesDecoder(BCF2Codec.java:419)
    at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:131)
    at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:58)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.<init>(TribbleIndexedFeatureReader.java:334)
    at htsjdk.tribble.TribbleIndexedFeatureReader.iterator(TribbleIndexedFeatureReader.java:301)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.iterator(FeatureDataSource.java:419)
    at java.lang.Iterable.spliterator(Iterable.java:101)
    at org.broadinstitute.hellbender.engine.VariantWalker.getSpliteratorForDrivingVariants(VariantWalker.java:37)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:106)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
    at org.broadinstitute.hellbender.Main.main(Main.java:233)
SHvatov commented 5 years ago

Dear @alecw, We have started working on your issue recently. During the work, we reproduced your bug and analyzed the .vcf file that you used during validation and subsequent conversion to the .bcf format. We would like to clarify the order of calling the utilities of GATK, which you used to get the final .vcf file. If we correctly understood, you called HaplotypeCaller (see the documentetion here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php) to get the file into the .g.vcf format. After that, you call the utility VariantFiltration(see the documentation here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php) to filter variant calls based on INFO and FORMAT annotations. After that, your original .g.vcf file was converted to .vcf format. This transformation, unfortunately, is not valid at the moment. As a result, you get a file in the .vcf format, which contains the attributes of the .g.vcf file. In general, according to the documentation of the gatk: “GVCF stands for Genomic VCF. A VCF is a form of VCF (see the spec documentation here: https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-agvcf-and-how-is-it-different-from-a-regular-vcf ), but a Genomic VCF contains extra information ”. As a result, due to the fact that ValidateVariants checks only fields and attributes corresponding to the .vcf file, your file is validated and an error occurs only at the conversion stage.

Dear @lbergelson @davidbenjamin, As a possible solution we suggest to add extra validation at least by extension to the ValidateVariants class of GATK (link to the Github: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/ValidateVariants.java). This can prevent errors like the one that occurred in this issue.

davidbenjamin commented 5 years ago

I'm not on the engine team and Louis is on paternity leave. I'll let @droazen and @jamesemery sort this out.

cmnbroad commented 5 years ago

Looks like there are a few things going on here. The ValidateVariants command above excludes all validation types:

gatk-launch ValidateVariants -R /broad/mccarroll/software/metadata/individual_reference/hg19/hg19.fasta -V buggy.vcf --validationTypeToExclude ALL

so no actual validation was being done there (this should probably issue a warning that no validation is being done).

But the VCF doesn't pass validation; there are problems with both AN and AD. The single variant in the test file has 3 values for AD (but there are only two alleles - ref and one alt); the BCF encoder should probably have have detected this, but it doesn't and happily encodes all three values. On decode, the BCF decoder believes the header (which specifies the AD count as R, so one ref and one alt) and only decodes two values, leaving the third value (which happens to be 0) in the stream to be consumed for the next attribute, resulting in the NPE.

alecw commented 5 years ago

Hi @cmnbroad ,

Note that if --validationTypeToExclude ALL is removed from command line, there are still no errors reported.

-Alec

cmnbroad commented 5 years ago

@alecw Right - I'm not sure what the behavior should be in that case - you could argue no validation should be done if you exclude ALL, though thatss probably not that helpful. Either way, there is a separate ticket https://github.com/broadinstitute/gatk/issues/5862 to address the GATK tool behavior.