sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
464 stars 78 forks source link

sourmash benchmarking thoughts #725

Open ctb opened 4 years ago

ctb commented 4 years ago

from slack conversation:

olga:is there a benchmarky paper showing sourmash’s performance in speed, cpu, memory, accuracy for contamination detection? trying to convince nextflow core devs to use it for rnaseq pipelines titus:speech_balloon: 1:33 PM no, not as such. benchmarking is conceptually challenging :slightly_smiling_face:. I value correctness over speed, in general, but once your code is correct then speed becomes important. Choice of C++ underneath is limiting in terms of flexibility there, which is a reason for moving to Rust. Re contamination detection, we know that the specificity is QUITE good. Less positive about sensitivity, especially to things that aren’t exactly in the databases.

So it’s partly a matter of defining what the specific question is. I do know that sourmash’s speed (or lack thereof) has been a blocker for some using it. Both reviews on our sourmash f1000 paper (https://f1000research.com/articles/8-1006) did highlight the question of how good our containment analysis approach was,

f1000research.com F1000Research Article: Large-scale sequence comparisons with sourmash. Read the latest article version by N. Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, C. Titus Brown, at F1000Research.

we know cpu and memory requirements are quite minimal, algorithmically (and in practice) but the real sales pitch for sourmash is flexible, adaptable, customizable, and developed in a way that’s not completely terrible :slightly_smiling_face:

ctb commented 4 years ago

@luizirber is exploring this in his thesis.

the charcoal project is exploring decontamination in bacterial/archaeal genome bins, while the [2020-long-read-assembly-decontam](https://github.com/ctb/2020-long-read-assembly-decontam project is exploring decontamination in long-read assemblies. I don't think that we have any specific RNAseq plans/collaborations, but no reason it shouldn't work, I think.