s-andrews / FastQC

A quality control analysis tool for high throughput sequencing data
GNU General Public License v3.0
453 stars 86 forks source link

Error with SRA file #140

Open markziemann opened 1 month ago

markziemann commented 1 month ago

Fastqc encounters an error with a fastq file from SRA with accession SRR29972717. It has only 81 reads, but one of the reads is 5,176,638 bp in length. Maybe it should handle unusual files like this in a more graceful way.

$ fastq-dump SRR29972717

$ wc -l SRR29972717.fastq 
324 SRR29972717.fastq

$ fastqc SRR29972717.fastq 
Started analysis of SRR29972717.fastq
Terminating due to java.lang.OutOfMemoryError: Java heap space
s-andrews commented 1 month ago

Yeah, there's not really a nice way to handle that. The problem is that just reading the next line from that file immediately exhausts the memory the program has available. There's no way to catch this in fastqc code as the last operation it did just said read the next line, and control never returns to the program. You can't capture out of memory events as the whole program is terminated when that happens. We've set it so that the exit status for the program will be nonzero, but that's as much as we can do.

For files with large sequences (you do get this with nanopore data for example) we do now allocate more memory than we used to, and you can use the --mem parameter to allocate more memory initially, but that's about as good as we're going to get with this.

In this case that file is corrupt - there's no way that's actually illumina data, so I'm not going to be too concerned about trying to support this. Broken data is always going to do weird things.