sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
455 stars 78 forks source link

add Python pseudocode for MinHash and FracMinHash into docs #2253

Open ctb opened 1 year ago

ctb commented 1 year ago

https://hackmd.io/RScxLBU-SCiEjdNHrJql5w

FracMinHash vs MinHash pseudocode

FracMinHash

Here the scaled parameter is tunable, and H is the size of the hash space (e.g. 2**64).

Note, no hashval >= H/scaled is ever kept.

sketch = set()
for kmer in kmers:
    if hash(kmer) < H/scaled:
        sketch.add(kmer)

MinHash

Here N is tunable - the number of kept hashes.

Note, len(sketch) should never be > N.

sketch = set()
for kmer in kmers:
    hashval = hash(kmer)

    # is the sketch full? maybe remove one.
    if len(sketch) == N:
        if hashval < max(sketch):
            sketch.remove(max(sketch))

    # not yet full, or we removed one - add hashval.
    if len(sketch) < N:
        sketch.add(hashval)
ctb commented 1 year ago

questions in slack -

One question though, if I were to set $N=H/scaled-1$ would I receive the same result in both operations?

No - the n is the number kept , while scaled refers to the fraction of hashes kept. To make them (approx) equal you would need to know how many distinct kmers were in the original set. for example if you sketched every kmer possible, you would end up with H/s hashes kept for frac. But still only N for Minhash.

I think I understand your point. Frac is shrinking (scaling) the entire Hash space while Min is setting an arbitrary maximum to the number of hashes within the hash space. Is it possible to set the arbitrary maximum to the same value as the shrunk Hash Space of Frac?

yes, approximately, as long as you know how many k-mers you have total - then you can estimate. an alternative way to think about that is to say that a FracMinHash with a length L can always be converted into a MinHash of size L :man-shrugging: . and you can always take a MinHash with a size of N, pick the largest hash M that is in it, and then say “this is equivalent to a FracMinHash with scaled = H/M” that’s what https://sourmash.readthedocs.io/en/latest/api-example.html?highlight=conversion#converting-between-num-and-scaled-signatures is trying to say