s-andrews / FastQC

A quality control analysis tool for high throughput sequencing data
GNU General Public License v3.0
427 stars 84 forks source link

error when opening valid .fastq.bz2 (Ran out of data in the middle of a fastq entry. Your file is probably truncated) #48

Open aushev opened 4 years ago

aushev commented 4 years ago

Trying to run FastQC for my bz2-compressed file fastqc 08asp.fastq.bz2 - I get the following error:

Started analysis of 08asp.fastq.bz2 Failed to process file 08asp.fastq.bz2 uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry. Your file is probably truncated at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:179) at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77) at java.base/java.lang.Thread.run(Thread.java:834)

I checked the integrity of the archive with bzip2 -t 08asp.fastq.bz2; more importantly, if I first decompress the very same file (bzip2 -d 08asp.fastq.bz2) and then run fastqc 08asp.fastq - it works without any issues. Sample file can be temporarily downloaded here (181 Mb).

s-andrews commented 4 years ago

I'll take a look at the file but I'm stuck with very limited internet for a while.

Do you know whether the file could have been created by concatenating several existing bz2 files initially? I know we had problems with the core decompressors for gzip when that had happened since it technically broke the spec, but the GNU tools were able to cope with it.

The test would be that if you decompress the file you have to a raw fastq, and then re-compress that to a bz2 file, is it then able to be read correctly?

aushev commented 4 years ago

Thank you Simon! It works indeed with the re-compressed bz2. I don't know if the original file was created by concatenation (is there any way to find it out for sure from the file itself?). Unfortunately I received those files from another source so I can't change how they are created.

s-andrews commented 4 years ago

I did some testing and it looks like it is the issue of having multiple headers in the middle of the file.

$ zcat file.fq.gz | head -1000000 > bzip1.fq
$ zcat file.fq.gz | head -1000000 > bzip2.fq
$ bzip2 bzip*fq
$ cat bzip1.fq.bz2 bzip2.fq.bz2 > bzip3.fq.bz2

$ bzcat bzip1.fq.bz2 | wc -l
1000000
$ bzcat bzip3.fq.bz2 | wc -l
2000000

$ fastqc bzip1.fq.bz2 bzip3.fq.bz2
Started analysis of bzip1.fq.bz2
Approx 5% complete for bzip1.fq.bz2
Approx 10% complete for bzip1.fq.bz2
Approx 15% complete for bzip1.fq.bz2
Approx 20% complete for bzip1.fq.bz2
Approx 25% complete for bzip1.fq.bz2
Approx 30% complete for bzip1.fq.bz2
Approx 35% complete for bzip1.fq.bz2
Approx 40% complete for bzip1.fq.bz2
Approx 45% complete for bzip1.fq.bz2
Approx 50% complete for bzip1.fq.bz2
Approx 55% complete for bzip1.fq.bz2
Approx 60% complete for bzip1.fq.bz2
Approx 65% complete for bzip1.fq.bz2
Approx 70% complete for bzip1.fq.bz2
Approx 75% complete for bzip1.fq.bz2
Approx 80% complete for bzip1.fq.bz2
Approx 85% complete for bzip1.fq.bz2
Approx 90% complete for bzip1.fq.bz2
Approx 95% complete for bzip1.fq.bz2
Approx 100% complete for bzip1.fq.bz2
Analysis complete for bzip1.fq.bz2

Started analysis of bzip3.fq.bz2
Approx 5% complete for bzip3.fq.bz2
Approx 10% complete for bzip3.fq.bz2
Approx 15% complete for bzip3.fq.bz2
Approx 20% complete for bzip3.fq.bz2
Approx 25% complete for bzip3.fq.bz2
Approx 30% complete for bzip3.fq.bz2
Approx 35% complete for bzip3.fq.bz2
Approx 40% complete for bzip3.fq.bz2
Approx 45% complete for bzip3.fq.bz2
Approx 100% complete for bzip3.fq.bz2
Analysis complete for bzip3.fq.bz2

So concatenating bzip2 files still works when decompressing when using the unix bzip tools, but the java bzip2 library we're using closes the stream when it hits the first end marker on the stream which is why it jumps from 45% complete to 100%. I'm not sure why yours would crash as I'd think that an incomplete processing would be the more likely response.

I know we had to work round a similar limitation in gzip compression which is why we have our own class for gzip decomression. I might be able to use the same strategy to work around the bzip decompressors limitations, or it's possible that there is an updated version which can deal with this.

s-andrews commented 3 years ago

I had a look and it seems the Jbzip2 library we're currently using doesn't support this, and seems to be unmaintained. However the apache commons compress library will work and in the documentation it says:

For the bzip2, gzip and xz formats as well as the framed lz4 format a single compressed file may actually consist of several streams that will be concatenated by the command line utilities when decompressing them. Starting with Commons Compress 1.4 the *CompressorInputStreams for these formats support concatenating streams as well, but they won't do so by default. You must use the two-arg constructor and explicitly enable the support.

So if we can switch to that it looks like we can work round this problem. We might also be able to get rid of the kludge we have for gzip streams.

alanhoyle commented 1 year ago

We ran into this exact issue with files we downloaded from a collaborator. The original .bz2 fails with uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry. Your file is probably truncated

but we have no problem with the decompressed fastq, or a recompressed file in .bz2 or .gz.

bzip2 -t $FILE does not show any errors either.

Our short term solution is to recompress everything into .gz files.

cschu commented 1 year ago

Is there a patch available for this (i.e. "process only the first file of a concatenated bzip2")?

In my case I can see in the log that fastqc stops after 5% of a 6-part bzip2 fastq (which makes sense in terms of file sizes) but there is no other error message and the process doesn't seem to fail. (fastqc-0.11.9-hdfd78af_1 from bioconda)