Move to less threads per sample for multi-sample runs.

LeeBergstrand commented 1 month ago

Problem Description

Currently, rotary gives every multithreaded rule the complete set of CPU threads. However, these jobs often only sometimes use all these available threads all the time. For example, flye will have a sizeable multithreaded stage followed by a rate-limiting single-threaded stage. However, because this rule is designated as requiring all threads, no other jobs can be started simultaneously while flye is on this single threads stage. When processing multiple samples, the wall time of a rotary run could be increased by giving each sample a subset of the total thread count, allowing the Snakemake scheduler to better intertwine jobs from different samples together and run them simultaneously. For example, if the two samples were processed using flye simultaneously rather than sequentially, then at least the single-threaded stage of each sample could be processed simultaneously, leading to a speed-up.

Problem Solution

Create a function that generates the number of threads for rotary multithreaded rules based on the number of threads available and the number of samples to process.

Notes

Applications like Interproscan run one thread per tool by default but run multiple tools simultaneously. So, tuning can be required to tune thread-to-job ratios. Running more threads than expected can sometimes lead to performance gains if jobs are completed fast enough.

LeeBergstrand commented 1 month ago

If we have the time, we could use load-averaging numbers (https://scoutapm.com/blog/understanding-load-averages) from multiple runs to find the best parameters regarding sample-to-thread ratios. However, a function that does something like one sample per 1/4 of total threads should be sufficient for now. It makes sense to leave some rules as they are because they are already running quite efficiently multithreaded (this is more important in the later annotation stage rules, so you don't get an ending where you are waiting for one sample to be annotated alone after it has been assigned only 4 threads of 12).

LeeBergstrand commented 1 month ago

Also, there are situations where some jobs may plateau in performance. For example, HMMER3 plateaus in performance after eight threads due to IO bottlenecks. This is also something to consider, especially if you're trying to spread jobs across a computer with a high CPU count.

LeeBergstrand commented 1 month ago

@jmtsuji Thoughts?

jmtsuji commented 1 month ago

@LeeBergstrand Great idea -- I agree that we will get better performance across multiple samples if we decrease the thread counts to individual jobs. Aside from the caveats you've already mentioned, one key thing to watch for will be RAM -- if multiple high-RAM processes run simultaneously (e.g., GTDB, possibly Flye, etc.), then the user's system could crash. Like you mentioned, I think a good starting point could be to calculate a thread count (and "memory count" where applicable) for jobs that we think could run well in parallel. This count could be something like minimum 4 threads (or # of user provided threads, if <4), maximum 1/4 of provided threads. We could then use this thread count for rules that we think could run well in parallel. Other rules (like GTDB, possibly Flye, etc.) could be given the full thread count. In future, we could allocate resources in a more nuanced way like you mentioned, if we really wanted to optimize performance. (Another nuance we could add would be to include memory estimates for rules, so that users with tons of RAM could run multiple Flye or GTDB instances simultaneously, for example.) Thoughts?

LeeBergstrand commented 1 month ago

@jmtsuji Yes, memory is also a concern that complicates things. I will think about this in more detail.

LeeBergstrand commented 4 days ago

@LeeBergstrand Great idea -- I agree that we will get better performance across multiple samples if we decrease the thread counts to individual jobs. Aside from the caveats you've already mentioned, one key thing to watch for will be RAM -- if multiple high-RAM processes run simultaneously (e.g., GTDB, possibly Flye, etc.), then the user's system could crash. Like you mentioned, I think a good starting point could be to calculate a thread count (and "memory count" where applicable) for jobs that we think could run well in parallel. This count could be something like minimum 4 threads (or # of user provided threads, if <4), maximum 1/4 of provided threads. We could then use this thread count for rules that we think could run well in parallel. Other rules (like GTDB, possibly Flye, etc.) could be given the full thread count. In future, we could allocate resources in a more nuanced way like you mentioned, if we really wanted to optimize performance. (Another nuance we could add would be to include memory estimates for rules, so that users with tons of RAM could run multiple Flye or GTDB instances simultaneously, for example.) Thoughts?

From the Flye FAQ docs:

For a typical bacterial assembly with ~100x read coverage, Flye needs <10 Gb of RAM and finishes within an hour using ~30 threads. This will scale linearly with the increase in read coverage. If you coverage is above 100x, consider use --asm-coverage 100 to use the longest 100x reads for disjointig assembly - this should speed things up.

@jmtsuji This is what I have seen so far. Most of the bacterial genomes I've been processing use about ~10 Gb of RAM or slightly less than that.

rotary-genomics / rotary