pirovc / ganon

ganon2 classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more
https://pirovc.github.io/ganon/
MIT License
86 stars 13 forks source link

possibility to integrate pigz #193

Closed rjsorr closed 2 years ago

rjsorr commented 2 years ago

I am classifying quite large fastq files that are gzipped and I'm wondering if a large part of the time used is for unzipping/zipping the files instead of classifying? I am happy to add pigz to my pipeline to speed things up, if needs be? but I'm wondering if pigz could be integrated in ganon, passing the same -t (threads) option?

regards

pirovc commented 2 years ago

From the tests I've done unzipping does not take more time than the classification. It may be a bottleneck if you use extremely many threads (>100), since the classification will be very fast. You can look at the thread usage with htop during classification and if they are not at 100% usage most of the time, they may be idle due to slow unzipping.

There's one way you could try to improve it: ganon read the fastq file in batches in a queue system controlled by 2 hidden parameters --n-reads 400 (how many reads are in a batch) and --n-batches 1000 (how many batches to keep in memory). Increasing those values may help.

I'm not familiar with pigz and the unzipping of fastq file in ganon is part of the seqan3 library, so an integration seems unlikely to me because this is not a bottleneck AFAIK.

rjsorr commented 2 years ago

Thanks for the suggestion @pirovc. Do you have any guideline suggestion for --n-batches in relation to thread numbers (e.g. using 10 threads)? and is this an option that maybe should be automatically adjusted in realtion to the thread input?

regards

pirovc commented 2 years ago

To clarify the internal working of those variables in ganon classify:

  1. there is one thread reading the input fastq files and adding reads into a queue of batches (--n-batches) of reads (--n-reads)
  2. there are N threads (--threads) removing those batches from the queue and processing them (classify, lca,...)

The usual scenario is that the queue is always full (with --n-batches x --n-reads reads) and as soon as it get batches removed, it gets filled again by the reading thread. The only way that changing those parameters will speed-up the process is when the queue is getting empty, meaning the step 2 is way faster then 1.

On my tests, I usually use 48 threads in very powerful cpus and those default values were okay for my setup, in other words, the queue was always full -> the reading was not slowing down the classification. But this really depends on weather the file is gzipped or not and on how fast your machine/disk is.

Changing those values may speed-up thing by playing with the overhead of parsing batches of reads. In your example with 10 threads I would try to increase --n-reads 1000 --n-batches 5000 but my intuition says it won't change much.

Ideally the tool should keep an eye on the queue and adjust those values dynamically but it's not something planned yet.

rjsorr commented 2 years ago

cheers.