sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

minimum number of reads/ bases needed for "good" detection? #923

Open phiweger opened 4 years ago

phiweger commented 4 years ago

not sure this is the right place to ask, but: is there a ball park estimate of how many reads/ bases are at least required to reliably identify an isolate of some bacterium? in terms of coverage, 1x should be sufficient right?

thx!

ctb commented 4 years ago

yep, 1x should be sufficient.

phiweger commented 4 years ago

thx!

ctb commented 4 years ago

wait wait, that was my too-quick answer-by-email. I want to give a better answer! please let me! ;)

(The below is all about scaled signatures. The mash, mash screen, and cmash papers answer the question for regular MinHash.)

specificity

In our experience, specificity with sourmash is largely about k-mer size and database: if the k size is long, and your k-mer database contains a match to your query, and the database is correct, then the match is real. In other words, for k=31 and k=51, species and strain level matches are accurate.

sensitivity

There are two different questions to ask about sensitivity. The first is genome size. If the genome is small relative to your scaled value, e.g. scaled=10000 and genome=5000, you're not likely to see a match no matter how deeply you sequence. This notebook shows you that Poisson statistics apply; the graphs are for scaled=1000 and genome sizes of 5000 and 10,000 respectively, and show you how many genomes of each size have how many hashes chosen for scaled=1000. tl;dr for a genome size of 5000, approx 99.3% of genomes have at least one hash when using a scaled=1000. (Conveniently since Poisson distributions shift linearly, you can adjust the stats by multiplying both numbers: if you are using a scaled of 2000, then 99.3% of 10kb long genomes will have at least one hash.)

The second question for sensitivity is the one you asked - how much sequencing? Assuming the matching genome is in the database, and the genome size is big enough that it has at least one hash, then I think similar statistics to the first sensitivity question apply: what's the likelihood that that hash has beens sampled? Well, it should be 99.3% for the first 5000 "correct" kmers, if you are using a scaled of 1000.

Given modern shotgun sequencing capacity, you should have enough k-mers, so it'll mostly be about whether your genome is in the database and the database is correct.

in summary

The major limitation to sensitivity in sourmash scaled matching is (1) database correctness and (2) genome size, not (3) sequencing depth.

appendix: benchmarks

You can look at @luizirber https://github.com/luizirber/2020-cami repository for some benchmark results.

ctb commented 4 years ago

a colleague asked why I keep on saying "if the database is correct". here's my response --

because the database often ISN’T, of course? :slightly_smiling_face: there are lots of reasons it might not be. the first is that many microbial genomes are still unknown. the second is that an increasing number of bacterial and archaeal genomes (and soon euk genomes) are known only from bioinformatics reconstruction from metagenomes, and we have well-informed suspicions that those, too, are substantially incomplete and/or contaminated.