sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
472 stars 80 forks source link

Can we explore refinements to the MinHash implementation? #1230

Open swamidass opened 3 years ago

swamidass commented 3 years ago

There are two refinements I'd like to explore with sourmash, which might improve on the current MinHash implementation.

  1. Count vectors can be used to estimate overlap between sets with more efficiency: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340911/

  2. There is a fairly straight-forward correction that can applied to MinHash similarity that should improve overlap estimates: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf

I would like to know the interest level in pursuing these. I'm an academic, and interested in publishing. Genome comparisons are not my normal area. Perhaps a collaboration might be interesting to someone more in that field.

CTB note: updated links

https://pubs.acs.org/doi/abs/10.1021/ci600526a https://pubmed.ncbi.nlm.nih.gov/25714898/

ctb commented 3 years ago

hi! thank you very much for posting these links!

I haven't had a chance to read them thoroughly yet, but I wanted to drop a note in here to say that we are using a non-standard MinHash for most of our sourmash work - Scaled MinHashes. This is discussed at length in @luizirber PhD thesis, and is also the subject of a (still draft) paper.

A quick skim of the second paper suggests that this approach might also work for Scaled MinHash, but that's an uninformed opinion at this point :)

This is also a topic that @dkoslicki might be interested in!

luizirber commented 3 years ago

There are two refinements I'd like to explore with sourmash, which might improve on the current MinHash implementation.

1. Count vectors can be used to estimate overlap between sets with more efficiency: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340911/

I recently added a HyperLogLog to sourmash using the maximum-likelihood estimators from "New cardinality estimation algorithms for HyperLogLog sketches", which are also using the counts for estimating overlap (the "Joint MLE" in the paper).

In the MinHash case, is the suggestion to use the hash abundances for estimating overlap?

2. There is a fairly straight-forward correction that can applied to MinHash similarity that should improve overlap estimates: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf

I would like to know the interest level in pursuing these. I'm an academic, and interested in publishing. Genome comparisons are not my normal area. Perhaps a collaboration might be interesting to someone more in that field.

(I need to read both papers more deeply, but so cool to see similar problems in other fields =])

swamidass commented 3 years ago

I'm glad to hear there is openness on this. I do think both I suggested are very easy to implement. Once you get a chance to read either paper more closely let me know.

swamidass commented 3 years ago

In the MinHash case, is the suggestion to use the hash abundances for estimating overlap?

Not exactly. The suggestion is to use eq. 12, 16 and 25 from this paper: http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf.

These formulas correct for over-saturation of the bloom vector. It should give you much more accurate estimates of the total overlap.

swamidass commented 3 years ago

Let me know if you need any information on this. It should be very easy to implement.