samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

Add a better error message when the reference is too long for indexing #1651

Open lbergelson opened 1 year ago

lbergelson commented 1 year ago

There have been two seperate issues this week with people running into a confusing indexing error when using a long reference. We should give a better error. We should ideally also identify this problem upfront instead of crashing late.

See https://github.com/broadinstitute/gatk/issues/8192 and this gatk forum post

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770
        at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:142)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
        at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:177)
        at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:233)
        at org.broadinstitute.hellbender.utils.variant.writers.GVCFWriter.close(GVCFWriter.java:71)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.closeTool(HaplotypeCaller.java:277)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1101)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
hjt1129 commented 1 year ago

hi, we have also meet this problem. So one question is, if we split the large chromosome in reference genome, the final location information of SNP on splitted chromosome will be different with the gff annotation file of previous vision reference genome?

lbergelson commented 1 year ago

@hjt1129 Yes, it's not a perfect solution. It definitely complicates annotation and comparison. You would need to either reverse the splitting before comparing with the gff or perform the same split on the gff in order to compare.

We'd like to support CSI for vcf but we haven't had the time to do so.

jd3234 commented 1 year ago

I have also run into this problem trying to use VariantQC for a VCF based on a wheat genome. These very large plant genomes require csi indices. Any chance to support this from your side? Plant world is catching up with sequencing ;)

WimSpee commented 1 year ago

@lbergelson The --create-output-variant-index false workaround unfortunately does not work. Still same Index 32770 out of bounds for length 32770 error.

This seems to be because GATK v4.4.0.0 HaplotypeCaller requires a BAI index file next to the input CRAM file. When no CRAI index file is next to the input CRAM file this error is given.

org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=285212672
***********************************************************************

A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.
Please index all input files:

BAI index is not possible because of the chromosome sizes.

I am not sure if an index is really needed for the input CRAM file for HaplotypeCaller to do its work. If it would be possible for HaplotypeCaller to do it work without an input CRAM index, that would be a preferable workaround for us, compared to splitting the chromosomes. Next to of course CRAI support :)

Thank you. Full command and error

gatk --java-options "-Xmx20g" HaplotypeCaller --create-output-variant-index false -R reference.fa -I input.cram -O output.gvcf.gz -ERC GVCF

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770
        at htsjdk.samtools.CRAMBAIIndexer$CRAMBAIIndexBuilder.processBAIEntry(CRAMBAIIndexer.java:391)
        at htsjdk.samtools.CRAMBAIIndexer$CRAMBAIIndexBuilder.access$200(CRAMBAIIndexer.java:254)
        at htsjdk.samtools.CRAMBAIIndexer.processBAIEntry(CRAMBAIIndexer.java:161)
        at htsjdk.samtools.cram.CRAIIndex.openCraiFileAsBaiStream(CRAIIndex.java:89)
        at htsjdk.samtools.SamIndexes.asBaiSeekableStreamOrNull(SamIndexes.java:94)
        at htsjdk.samtools.CRAMFileReader.initWithStreams(CRAMFileReader.java:203)
        at htsjdk.samtools.CRAMFileReader.<init>(CRAMFileReader.java:194)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:432)
        at htsjdk.samtools.SamReaderFactory.open(SamReaderFactory.java:106)
        at org.broadinstitute.hellbender.engine.ReadsPathDataSource.<init>(ReadsPathDataSource.java:245)
        at org.broadinstitute.hellbender.engine.ReadsPathDataSource.<init>(ReadsPathDataSource.java:181)
        at org.broadinstitute.hellbender.engine.GATKTool.initializeReads(GATKTool.java:453)
        at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:724)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:79)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:147)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
lbergelson commented 1 year ago

@WimSpee I'm sorry. I gave you bad advice. I thought HaplotypeCaller would run without an index if you didn't give it any -L argument, most GATK tools do. It looks like it does some complicated internal sharding though, which ends up using an index. I don't see any easy work around here. It would be great to add CRAI/CSI support but I don't have any timeline for it.