theiagen / public_health_viral_genomics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of viral pathogens of concern, especially SARS-CoV-2
https://public-health-viral-genomics-theiagen.readthedocs.io/
GNU Affero General Public License v3.0
40 stars 17 forks source link

fastq-scan fails on large FASTQs #183

Open kapsakcj opened 1 year ago

kapsakcj commented 1 year ago

2GB of RAM ain't enough when your FASTQ files are >11GB in size, like from a NovaSeq.

This line:

https://github.com/theiagen/public_health_viral_genomics/blob/main/tasks/quality_control/task_fastq_scan.wdl#L50

and this line:

https://github.com/theiagen/public_health_viral_genomics/blob/bd7f8a9936ccb3548d2e1d88302b2e0e4b7b8032/tasks/quality_control/task_fastq_scan.wdl#L87

should be upped to at least 8 GB.

Although...when I ran the 11GB FASTQ file through the WDL on the commandline, it consumed upwards of 18GB of RAM, so if Terra kicks in the "memory retry" feature then these files should get processed fine with 2nd or 3rd attempts

kapsakcj commented 1 year ago

For this particular failure, we are downsampling the FASTQs with RASUSA first, but it doesn't hurt to fix these potential issues anyways

rpetit3 commented 1 year ago

Large fastqs are now supported in fastq-scan (https://github.com/rpetit3/fastq-scan/releases/tag/v1.0.1). But I agree with your approach of subsampling to a reasonable coverage