s-andrews / FastQC

A quality control analysis tool for high throughput sequencing data
GNU General Public License v3.0
445 stars 86 forks source link

Feature suggestion: use smarter method to determine file types for decompression. #78

Closed alanhoyle closed 2 years ago

alanhoyle commented 3 years ago

We have a few files that have been inaccurately named. E.g. instead of blah.R1.fastq.bz2, it might be blah.R1.fastq.bz

This causes FastQC to fail with the following error:

fastqc blah.R1.fastq.bz
Failed to process blah.R1.fastq.bz
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '@'
    at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:158)
    at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:89)
    at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
    at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
    at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
    at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
    at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)

However, the files are properly formatted bzip2 fastq files, just with the wrong extension:

$ file -z blah.R1.fastq.bz
blah.R1.fastq.bz: ASCII text (bzip2 compressed data, block size = 900k)
$ bzcat  blah.R1.fastq.bz | head -2
@ABC-DE1234:567:A1B2C4XX:1:2345:6889:01234 1:N:0:TCGATCGA
TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA

The problem occurs because FastQC/Sequence/FastQFile.java determines file type by looking at the file extension, and not using a smarter method that uses the actual file contents.

    if (file.getName().startsWith("stdin")) {
            br = new BufferedReader(new InputStreamReader(System.in));
        }
        else if (file.getName().toLowerCase().endsWith(".gz")) {
            br = new BufferedReader(new InputStreamReader(new MultiMemberGZIPInputStream(fis)));
        } 
        else if (file.getName().toLowerCase().endsWith(".bz2")) {
            br = new BufferedReader(new InputStreamReader(new BZip2InputStream(fis,false)));
        } 

        else {
            br = new BufferedReader(new InputStreamReader(fis));
        }

I would suggest looking at the java.nio.file.Files.probeContentType() method or Apache's Tika library to determine file MIME types instead of relying on the filenames being accurate.

s-andrews commented 2 years ago

I had a look at this using the probeContentType() method, but although it works OK for gzipped files it gives a null result for bzipped files so it doesn't really work in this context. Shame really as this would have been a nice addition and easy to add, but I'm not going to spend a long time working out more robust auto-detection for something which is quite niche, sorry.

alanhoyle commented 2 years ago

Thanks for taking a glance.