Open ctb opened 1 year ago
questions in slack -
One question though, if I were to set $N=H/scaled-1$ would I receive the same result in both operations?
No - the n is the number kept , while scaled refers to the fraction of hashes kept. To make them (approx) equal you would need to know how many distinct kmers were in the original set. for example if you sketched every kmer possible, you would end up with H/s hashes kept for frac. But still only N for Minhash.
I think I understand your point. Frac is shrinking (scaling) the entire Hash space while Min is setting an arbitrary maximum to the number of hashes within the hash space. Is it possible to set the arbitrary maximum to the same value as the shrunk Hash Space of Frac?
yes, approximately, as long as you know how many k-mers you have total - then you can estimate. an alternative way to think about that is to say that a FracMinHash with a length L can always be converted into a MinHash of size L :man-shrugging: . and you can always take a MinHash with a size of N, pick the largest hash M that is in it, and then say “this is equivalent to a FracMinHash with scaled = H/M” that’s what https://sourmash.readthedocs.io/en/latest/api-example.html?highlight=conversion#converting-between-num-and-scaled-signatures is trying to say
https://hackmd.io/RScxLBU-SCiEjdNHrJql5w
FracMinHash vs MinHash pseudocode
FracMinHash
Here the
scaled
parameter is tunable, andH
is the size of the hash space (e.g. 2**64).Note, no hashval >=
H/scaled
is ever kept.MinHash
Here
N
is tunable - the number of kept hashes.Note,
len(sketch)
should never be > N.