sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

is rayon turned on in conda-forge releases? #3314

Open ctb opened 2 months ago

ctb commented 2 months ago

@dkoslicki asks on matrix:

quick question: is sourmash sketch dna now multithreaded in 4.8.11? My understanding was that it was single threaded, but I just noticed via top and time that a single sourmash sketch dna command is taking something like 100+ threads on my server (i.e. %CPU around 20000%). And no updated in the docs that mentions controlling the number of threads used...

first - sourmash-rs uses rayon for parallelism, and for rayon, the environmental variable RAYON_NUM_THREADS can be used to control the expected number of threads. Setting this to e.g. 16 should limit rayon to using 16 threads.


second - my understanding is that Rust-based parallelism is enabled conditionally in the Rust codebase, using the parallel feature (which is automatically turned on by the branchwater feature, as in e.g. the branchwater plugin). It looks like this may be enabled by default in pyproject.toml:

https://github.com/sourmash-bio/sourmash/blob/c7fc46012dcf9156003dfc58d60a124f0f480e9c/pyproject.toml#L153

which is cool, if so, but should probably be documented somewhere! ;)


aaaaand third - yes, it looks like sketching operates in parallel at the level of multiple sketch types, e.g if you are doing a bunch of different k-mer sizes, then each k-mer size is sketched ! see:

https://github.com/sourmash-bio/sourmash/blob/c7fc46012dcf9156003dfc58d60a124f0f480e9c/src/core/src/signature.rs#L661-L668


so that's cool :).

So my end take is: I think we should probably document this somewhere!