s-andrews / FastQC

A quality control analysis tool for high throughput sequencing data
GNU General Public License v3.0
444 stars 86 forks source link

from a sysadmin perspective: Is FastQC CPU bound or I/O bound? #118

Closed georgemarselis-nvi closed 1 year ago

georgemarselis-nvi commented 1 year ago

Apologies, I skimmed through the manual but I could not get a clear answer. I am not a scientist, so i apologize if I overlooked something.

Here at the Norwegian Veterinary Institute we use FastQC in a workflow, right after demultiplexing but before doing any analysis. The way I have programmed the workflow is pretty simple:

/usr/bin/fastqc -t 64 *bunch of .fastq.gz files*

Takes about 5 minutes to from start to end, but I am greedy, I want to speed this up. So, what do I do? add more cores or buy some PCIe-5 nvme drives and raid them?

edit: i suspect that the most expensive operation is unpacking the .gz files and that is disk-bound.

s-andrews commented 1 year ago

It's pretty close between the two to be honest - both the reading of data from the disk and the gzip decompression are fairly heavy. Most of the analyses FastQC runs are pretty quick, and the ones which aren't have been optimised to try to not make them limiting. Which factor ends up being limiting overall will depend on the performance of your storage, your CPU and the nature of your data (read length and complexity).

The parallelisation in FastQC is done per file, so the analysis of a single file is always done with a single thread. Throwing more threads at the program will just mean it can process more files in parallel, so there's no point in assigning more threads than files. If you have a large number of files then you'll see a benefit to adding more threads, up until the rate that your machine can read data from all of those files becomes limiting at which point there's not much more you can do. As the disk starts to become limiting the additional performance per thread will dimish so I guess you'll need to decide at what point it becomes to inefficient for you.

If this is for a sequencing pipeline then having a staging area on quick SSD storage would certainly be a good idea for the very early processing of the data, but I'd certainly advise ensuring that your data is backed up to a secondary location very early in the process to avoid any potential data loss if a drive does go out.

georgemarselis-nvi commented 1 year ago

will depend on the performance of your storage, your CPU and the nature of your data (read length and complexity).

awesome. good to know. We just purchased 20 TB of nvme pcie 5.0 storage

but I'd certainly advise ensuring that your data is backed up to a secondary location very early in the process to avoid any potential data loss if a drive does go out.

of course :smile: