samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Getting java.lang.InternalError while querying a VCFFileReader #1442

Closed Habush closed 4 years ago

Habush commented 4 years ago

Description of the issue:

I am trying to distribute vcf file contents across multiple machines to access in-memory using Apache Ignite. I am saving VCFFileReader object and retrieving it later to query for a specific variant. Here is a snippet of code that might explain what I am trying to do:

VCFFileReader vcfReader = new VCFFileReader(new File(vcfiles[0].getPath()), new File(indexFiles[0].getPath()), true);
genomeRepo.save("1k", vcfReader); //this is where I save the object to a  memory cache to retriev it later
//.....

VCFFileReader vcfFileReader = genomeRepo.findById("1k"); //and I retrieve it here

vcfFileReader.query(chrom, startPos, endPos); //this is where the error occurs

Executing the above code results in the error below Provide screenshots , stacktraces , or logs where appropriate.

Exception in thread "main" java.lang.InternalError
        at java.util.zip.Inflater.reset(Native Method)
        at java.util.zip.Inflater.reset(Inflater.java:352)
        at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:141)
        at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:96)
        at htsjdk.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:550)
        at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:532)
        at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
        at htsjdk.samtools.util.BlockCompressedInputStream.seek(BlockCompressedInputStream.java:380)
        at htsjdk.tribble.readers.TabixReader$IteratorImpl.next(TabixReader.java:428)
        at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:46)
        at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:170)
        at htsjdk.tribble.TabixFeatureReader$FeatureIterator.<init>(TabixFeatureReader.java:159)
        at htsjdk.tribble.TabixFeatureReader.query(TabixFeatureReader.java:133)
        at htsjdk.variant.vcf.VCFFileReader.query(VCFFileReader.java:322)

Your environment:

Steps to reproduce

If you're reporting a bug, tell us how to reproduce this issue. If possible, include a short code snippet or attach test data to demonstrate the problem.

Expected behaviour

It should query the variant by position and return a Closeable<VariantContext>

Actual behaviour

It runs into java.lang.InternalError

Any pointers as to why this is happening and how I can fix it will great. Thanks.

lbergelson commented 4 years ago

@Habush I'm not familiar with Apache Ignite. VCFFileReaders are not generally thread safe or serializable, so if you're trying to share a reader across multiple threads or machines you're going to have trouble. I would recommended opening an independent reader in each thread and using and index to query multiple independent regions instead of trying to share 1 reader.

If you're trying to read vcfs across multiple machines then there is a downstream project DISQ which is an implementation of bam/vcf reading in a distributed way using apache spark. You may be able to either use it directly or adopt some ideas from it's implementation.

Habush commented 4 years ago

@lbergelson Thank you for pointing me to DISQ, I'll check it out.