Open dodoflyy opened 1 year ago
hi @GuoYu-Peng, the math is straightforward (but not written down clearly anywhere yet) - here's my best effort :)
We can model the FracMinHash sketch algorithm as follows: for a stream of randomly generated k-mers (with k sufficiently large that the k-mers are effectively sampled without replacement (e.g. k=21)), FracMinHash with scaled=S adds hashes to the sketch at an average rate of one in every S k-mers.
The process of adding hashes to the sketch is a Poisson process, since for any given number of k-mers a discrete non-negative number of k-mers will be retained in the sketch, the events occur independently, and at most one k-mer is hashed at each time point.
Thus the distribution of output sketch sizes for a given number of input k-mers M is a Poisson distribution with lambda=M/S.
For genome comparison purposes, we want to know the average runlength of k-mers for which no hashes will be retained; this will inform the choice of the scaled parameter S for sensitively detecting overlaps of a size M k-mers. The average runlength can be easily be evaluated for random sequences (or, in genomic parlance, high complexity / non-repetitive genomic sequence) by calculating the probability of generating a sketch of size 0 given M k-mers.
Using Poisson statistics, this is:
Poisson(k=0, lambda=M / S) = exp(- M / S)
Then the probability of at least one hash being generated for a collection of k-mers M in size is:
1 - exp(-M / S)
and we see that this matches empirical observations using sourmash.
to answer your second question: small genomes will simply not be detected, so the risk is low sensitivity, not false classification.
Happy to answer more questions or clarify, let us know!
Hello, database docs said: