Open kullrich opened 9 months ago
This is not a detailed answer.
First, the serial code you provided could be easily parallelized with the Python concurrent
or multiprocessing
libraries. GPU is really not the solution here, and I would recommend using the sourmash CLI instead of the API, as the CLI provides the required functionalities.
However, in your case of pairwise comparisons, you might be interested in other solutions provided by sourmash.
You can use sourmash sketch
to sketch your data separately from processing. Then, use sourmash compare
to compare your signatures, which supports parallelization. This will efficiently do the job for you.
Use the sourmash plugin branchwater, where you can find the pairwise
command that is more efficient than sourmash compare
.
Hi, thanks for the answer. I know about the multiprocessing library and I will give the branchwater plugin a try.
Anyhow, since on a GPU one would have more than 1000 cuda cores to calculate the jaccard distances, with CPU one would only have up to 64/128 threads to use in parallel.
Since the comparison is not only for two species but in best would be done on e.g. the ensembl vertebrate species (300 x 300 x 20.000 genes), in the end one would need to be as quick as diamond
or last
to be useful for the community.
hi @kullrich, a few more thoughts for you!
diamond and last do alignments, which might be more appropriate for distant homology/similarity analysis. It's not a use case we've explored specifically, since (at least for the moment) we're mostly focused on comparisons between longer sequences and/or higher similarity.
however, using k-mers has some distinct advantages over n**2 brute force comparisons in certain cases; in particular, you can use inverted indexes in situations with sparse similarities. For example, the branchwater plugin provides an inverted index that we have used to index TBs of metagenome data for realtime search - see branchwater.sourmash.bio for our demo case. @mr-eyes is in fact thinking about introducing that for pairwise
, see https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/219.
extending sourmash to use CUDA is not something we've really thought about in any serious way, I'm afraid.
Hi,
to use pairwise kmer for species comparison is implemeted in pantools
(https://pantools.readthedocs.io/en/stable/)
, see publication here:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2362-4
super cool, thanks for the reference!
Dear sourmash-bio team, this is a feature request. Unlike nt-genomes I am comparing protein based minhashes. Here, every single protein of one species and its corresponding minhash is compared to all proteins from a second species. The jaccard distance will than be used to do network-based clustering. Anyhow, the CPU based comparison is limited in parallelsation, so my feature request is related to transfer the pairwise jaccard distance calculation via CUDA on a GPU, to speed up species comparison. Could you provide this feature in sourmash?
Thank you in anticipation
Best regards
Kristian
Simple CPU serial code would like like this: