samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
276 stars 244 forks source link

Truncation issue with GZIPInputStream when reading bgzip-compressed streams #1691

Open jkmatila opened 9 months ago

jkmatila commented 9 months ago

Description of the issue

When reading bgzip-compressed data from a stream (e.g. over the network or from a pipe), htsjdk can stop early and not read the stream all the way to the end, processing partial data. This is due to using java.util.zip.GZIPInputStream which has a known bug JDK-7036144 affecting reading concatenated gzip streams when the complete size of the stream is not known beforehand.

The problem has been discussed earlier in the comment thread to broadinstitute/gatk issue #4224, but it seems that a part of the recommended fix, replacing the use of the buggy GZIPInputStream class, was not adopted. There also is a comment in SamInputResource.java which references this problem and the aforementioned JDK bug.

For example, GzipCompressorInputStream from Apache Commons Compress would be an alternative which does not suffer from this problem, as noted in its Javadoc.

Your environment

Steps to reproduce

Here is an example program which demonstrates one instance of the problem, VCFIteratorBuilder using the problematic GZIPInputStream class when reading a bgzipped VCF stream, and only part of the stream gets read. (Note: The code attempts to read a 200 MB file from the Web, although it in practice due to the bug it does not get very far.)

import java.io.IOException;
import java.io.InputStream;
import java.net.URL;

import htsjdk.variant.vcf.VCFIterator;
import htsjdk.variant.vcf.VCFIteratorBuilder;

public class Repro {

    public static void main(String[] args) throws IOException {
        URL url = new URL("https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz");
        int expectedNumRecords = 1103547;

        int numRecords = 0;
        try (InputStream in = url.openStream();
             VCFIterator vcfIterator = new VCFIteratorBuilder().open(in)) {
            while (vcfIterator.hasNext()) {
                vcfIterator.next();
                numRecords++;
                if (numRecords % 100000 == 0) {
                    System.out.println("Read " + numRecords + " records...");
                }
            }
        }

        if (numRecords != expectedNumRecords) {
            System.err.println("URL: " + url);
            System.err.println("Expected: " + expectedNumRecords + " records");
            System.err.println("Parsed: " + numRecords + " records");
            System.exit(1);
        }
    }

}

There are also other places where GZIPInputStream is used in htsjdk.

Expected behaviour

The compressed VCF stream should be read all the way to the end and all records parsed.

Actual behaviour

The code stops reading the stream at some point, not parsing all records. No error is returned and parsing appears successful but in fact produces only part of the expected data. The exact location where it stops probably depends on how the block boundaries happen to coincide with the buffer size. It appears that if you have a large enough input file, it will likely stop at some point before reaching the end.

On both OpenJDK 17 (Eclipse Temurin-17.0.8.1+1) and on OpenJDK 21 (Eclipse Temurin-21+35) using the official Eclipse Temurin Docker images the above code outputs the following:

URL: https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Expected: 1103547 records
Parsed: 7395 records
lbergelson commented 7 months ago

Yeah, this bug is awful. We worked around it in GATK but never really fixed it in htsjdk. I decided to message the java development mailing list last week about it. I wish I had done that years ago, since someone responded really quickly. After 12 years or so since it was first reported, this bug is going to be fixed soon. I'm hoping/asking that it be back ported to java 17 or at least 21.

See https://github.com/openjdk/jdk/pull/17113