sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

scaled size and match size? #2548

Open dodoflyy opened 1 year ago

dodoflyy commented 1 year ago

Hello, database docs said:

The signatures were calculated with a scaled of 1000, which robustly supports searches for ~10kb or larger matches.

I want to know how "~10kb" were estimated, and does this means genomes like some virus smaller than 10kb may be false classification/taxonomy?

ctb commented 1 year ago

hi @GuoYu-Peng, the math is straightforward (but not written down clearly anywhere yet) - here's my best effort :)

We can model the FracMinHash sketch algorithm as follows: for a stream of randomly generated k-mers (with k sufficiently large that the k-mers are effectively sampled without replacement (e.g. k=21)), FracMinHash with scaled=S adds hashes to the sketch at an average rate of one in every S k-mers.

The process of adding hashes to the sketch is a Poisson process, since for any given number of k-mers a discrete non-negative number of k-mers will be retained in the sketch, the events occur independently, and at most one k-mer is hashed at each time point.

Thus the distribution of output sketch sizes for a given number of input k-mers M is a Poisson distribution with lambda=M/S.

For genome comparison purposes, we want to know the average runlength of k-mers for which no hashes will be retained; this will inform the choice of the scaled parameter S for sensitively detecting overlaps of a size M k-mers. The average runlength can be easily be evaluated for random sequences (or, in genomic parlance, high complexity / non-repetitive genomic sequence) by calculating the probability of generating a sketch of size 0 given M k-mers.

Using Poisson statistics, this is:

Poisson(k=0, lambda=M / S) = exp(- M / S)

Then the probability of at least one hash being generated for a collection of k-mers M in size is:

1 - exp(-M / S)

and we see that this matches empirical observations using sourmash.


to answer your second question: small genomes will simply not be detected, so the risk is low sensitivity, not false classification.

Happy to answer more questions or clarify, let us know!