sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
474 stars 80 forks source link

sourmash sketch & search use one thread only #2458

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Dear Sourmash team,

I want to create sketches for all GTDB genomes and I am using the following command according to tutorial (I want one sketch per fasta file):

time sourmash sketch dna -p k=16,noabund --from-file ./gtdb_v207_name.txt -o ./gtdb_v207_sourmash

the gtdb_v207_name.txt is path of all gtdb genome files. However I noticed that sourmash always use only one thread to sketch all the files. It this the default option, or we need parallel it at task level for ourselves like parallel command to use all cores/threads.

Thanks,

Jianshu

ctb commented 1 year ago

hi @jianshu93 yes, 'tis true!

right now there are two suggested solutions -

I've also built a simple plugin, sketchall to do it, but it's not really ready for anyone to use just yet 😓 - the plugin framework isn't released in any versions of sourmash yet, in particular!

tl;dr parallel should work great!

some backstory

The main blocker for me adding this into sourmash sketch has just been this issue: https://github.com/sourmash-bio/sourmash/issues/1911 - we don't have a good multiprocess/multithread way to write sketches to a single file, and I am not enthusiastic about writing up something more clever (multiple consumers, one producer).

Also relevant: https://github.com/sourmash-bio/sourmash/issues/1703 - not sure what's going on here!

jianshu93 commented 1 year ago

Hello Prof. C. Titus Brown,

Thanks for the quick response and it is helpful. I have no problems running sketch via parallel. However, the search command (after index the database, very fast, 20 minutes for all GTDB genomes) is also not parallelized, meaning when searching multiple queries, I still have to use parallel to do multiple searches. I am curious, compare to parallel searching the database (even for one query), task level parallel will be slower right because we need to initialize 8000 jobs to search 8000 queries. and also because processes cannot share memory with each other, we need #number of threads * database size memory to search #number of threads genomes. SBT can be easily paralleled to do search right since it is essential a tree like structure.

Thanks,

Jianshu

ctb commented 1 year ago

fantastic - glad the sketch stuff worked out!

Please see https://github.com/sourmash-bio/sourmash/issues/2071 re our previous answer on search parallelization!

The short version is:

There are other technologies coming along but we don't have them at a good level, I'm afraid!

jianshu93 commented 1 year ago

Hello Prof. C. Titus Brown,

for single query search, it takes 4.20 minutes for searching and I use GNU parallel to do process-level parallelism (initializing multiple jobs), which is much slower and requires much more memory for searching for example 24 queries at the same time by GNU parallel (4.5G * 24 = 108G). It takes about 20 hours to search 8000 queries against GTDB. Is this normal of I miss something.

Thanks,

Jianshu

ctb commented 1 year ago

hi @jianshu93 per https://github.com/sourmash-bio/sourmash/issues/1958, this sounds about right; those benchmarks are not for entire GTDB, but the numbers align with my expectations!

You could potentially speed things up (while reducing sensitivity a bit) by using --scaled=10000. sqldb would also support faster search, but at the cost of more memory and a LOT more disk space.

Thanks for reporting this! Gives us some targets!

jianshu93 commented 1 year ago

Hello Prof. C. Titus Brown,

I find this paper very interesting, published recently: https://dl.acm.org/doi/abs/10.1145/3448016.3457333

It is not SBT but beat SBT in many way it seems (N^(1/2)* log(N), very good sublinear algorithm). I am not aware of any Rust implementation though for this data structure.

Thanks,

Jianshu

ctb commented 1 year ago

thank you!

(Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO))

also ref https://github.com/sourmash-bio/sourmash/issues/1110, https://github.com/sourmash-bio/sourmash/issues/545

jianshu93 commented 1 year ago

Hello Prof. C. Titus Brown, I use sourmash index to index all NCBI prokaryotic assemblies\genomes using the same sketch step above, that is all in refseq+GeneBank, a total of about 300k genomes, the index size is about 15G, If I want to search 24 queries at a time, I will need 24*15=360G, which is quite a lot for only 24 queries (I have 24 threads). Is there a way to reduce it somehow, e.g. I can split the database into pieces and search each piece and collect results from each piece and sort according to output distance or something. It seems to take some time to split the database. Any better idea to automate this process. I think have all the query have access to the database at the same time is quite important to reduce memory, that is to parallel search the database/ The RAMBO paper mentions that SBT was designed for single-thread, which is the bottleneck. It is still the bottleneck now right.

Thanks,

Jaisnhu

ctb commented 1 year ago

hi @jianshu93,

responses to a few of your questions - just remind me if I missed something!

On to some practical advice -

If you want to index just a subset of a large database, you can do that with picklists - see docs. Basically, you create one or more CSV files containing the names or identifiers for the subset you want, and then run sourmash index like so:

sourmash index subset.sbt.zip all_signatures.zip --picklist filename.csv:identCol:ident

where all_signatures.zip is the entire database and subset.sbt.zip is the subset SBT you want to build.

Partly motivated by your interest, I made some progress this morning on a Rust-based manysearch plugin that builds on sourmash branchwater to do multithreaded searching - here are some stats:

impl time memory notes
sourmash search 12m 43s 3.67 GB single genome x 65k
manysearch 36s 139 MB 5 genomes x 65k; 32 threads

It's not really ready for anyone but me to use yet, and there are a few drawbacks to it, but I will keep you in the loop in this issue as it matures!

(I'm working on it over here)

jianshu93 commented 1 year ago

Thanks for the info. I will try and report back.

Thanks,

Jianshu

ctb commented 1 year ago

hi @jianshu93 the pyo3_branchwater plugin is getting pretty mature - you might be interested in the manysearch and multisearch commands. In particular, you can do 80k x 80k genome comparisons in under 5 GB RAM in 90 minutes on 64 CPUs with multisearch. It's still got some inconveniences compared to the full sourmash CLI, but it's coming along!

jianshu93 commented 1 year ago

Hello Titus, It is really nice to hear about that new command and I will definitely try it for 80K X 80 K.

Thanks,

Jianshu

ctb commented 1 year ago

pyo3_branchwater now supports massively parallel sketching - e.g. all of GTDB rs17 in 40 minutes and 2.7 GB of RAM.

see https://github.com/sourmash-bio/pyo3_branchwater/issues/122 and https://github.com/sourmash-bio/pyo3_branchwater/pull/96#issuecomment-1709190601 for some numbers.

I'm leaving this open because it's not integrated into sourmash yet, tho :). That's coming eventually!

jianshu93 commented 1 year ago

Hello Titus,

This is amazing news and seems it is time for me to run some real dataset, e.g, entire NCBI/RefSeq genomes (318K). I will get back to you when I have some results.

Jianshu

ctb commented 1 year ago

great! please feel free to post questions here (in this issue tracker) since we monitor this more closely - and you can tag in @bluegenes if you like :)