sourmash benchmarking thoughts

from slack conversation:

olga:is there a benchmarky paper showing sourmash’s performance in speed, cpu, memory, accuracy for contamination detection? trying to convince nextflow core devs to use it for rnaseq pipelines titus:speech_balloon: 1:33 PM no, not as such. benchmarking is conceptually challenging :slightly_smiling_face:. I value correctness over speed, in general, but once your code is correct then speed becomes important. Choice of C++ underneath is limiting in terms of flexibility there, which is a reason for moving to Rust. Re contamination detection, we know that the specificity is QUITE good. Less positive about sensitivity, especially to things that aren’t exactly in the databases.

So it’s partly a matter of defining what the specific question is. I do know that sourmash’s speed (or lack thereof) has been a blocker for some using it. Both reviews on our sourmash f1000 paper (https://f1000research.com/articles/8-1006) did highlight the question of how good our containment analysis approach was,

f1000research.com F1000Research Article: Large-scale sequence comparisons with sourmash. Read the latest article version by N. Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, C. Titus Brown, at F1000Research.

we know cpu and memory requirements are quite minimal, algorithmically (and in practice) but the real sales pitch for sourmash is flexible, adaptable, customizable, and developed in a way that’s not completely terrible :slightly_smiling_face:

sourmash-bio / sourmash

sourmash benchmarking thoughts #725