marbl / MashMap

A fast approximate aligner for long DNA sequences
Other
265 stars 39 forks source link

bottom K sketch time complexity #57

Closed jianshu93 closed 1 year ago

jianshu93 commented 1 year ago

Dear MashMap team,

The bottom-k sketch used in both MashMap2 and MashMap3, it's time complexity is still O(dk), where d is the number of elements in set while k is the number of MinHashes or hash functions. I read that in the newest paper, each kmer is only need to be hashed once, since there will be many hash functions, this will still be O(dk) right. I am comparing to the original Minhash.

Thanks,

Jianshu

bkille commented 1 year ago

Hi @jianshu93,

Minhash can be implemented in a number of ways. See this review by E. Cohen for a really nice overview. In particular, the minhash you are referencing is the k-mins sketch.

As opposed to the k-mins sketch, the bottom-k sketch uses a single hash function and samples the lowest k values. A bottom-k sketch of a single window of w elements therefore requires O( w log( k )) time. Computing the reference index requires sliding the bottom-k sketch of our length w window over_ a reference of n elements, and since we can update the sliding window in log( w ) time, this takes O( n log( w )) time total. There is an additional step that sorts the sampled minmers, but in practice it takes less time than the sampling.

jianshu93 commented 1 year ago

Hello @bkille,

So it is essentially O(nlog(w)w*log(k)) for n elements in a genome (in the case of genome genome comparison), and if we do it in a reciprocal way, should be double of the big O annotation mentioned. In the case of ANI, w is always 3000 elements in a window, k is always 10,000 to be accurate at high than 99% ANI (in Mash at lease), this should be much smaller in this case, maybe only several hundred. I am thinking, with new MinHash algorithms, instead of the bottom k sketch idea, 2 set with n elements, k minhashes or hash functions, it is only O(n+k), it is much faster than the bottom-k sketch idea.

Jianshu

bkille commented 1 year ago

Hi @jianshu93,

To avoid confusion with k-mer size, lets let s denote the sketch size. I'll also use n to refer to the reference genome length and m to refer to the query genome length.

While sketching a single window from scratch would require O(w log( s )) time, sliding an existing sketch one k-mer to the right only requires O(log(w)) time. Therefore, the complexity of constructing the reference index is O( n log(w)), and the complexity of sketching for each query segment is O( w log(s)). Since we have m / w segments, we end up with O(m log (s )) time complexity for the query sketching.

jianshu93 commented 1 year ago

Hello @bkille,

This is clear to me now. Many thanks for the detailed explanation. For a genome pair, we end up with query sketching + reference index, which is O(n log(w) + m log(s)), and when n and m is very close to each other for a pair, it should then be O(n (log(w)+log(s)). In the case of one genome to one genome comparison, e.g., find best hits in query genome (after chopping into fragements) in reference genome and estimate identity, what is big O for bottom sketch, k-min is O(n*k), where n is genome size and k is number of sketches. Can it be better than O(n + k)?

Another question is space complexity, I feel like in the case of many pairs to many pairs genome comparisons, both minimizer and minimer is very memory inefficient. It has something to do with FastANI implementation but I also noticed that MashMap3 is even worse than MashMap2. 1000 bacterial genomes agains 1000 bacterial genome, it will need more than 300G memory but those 2 collection of genomes is only like 4G or something, while MinHash like ones is very memory efficient than winnowing idea. Thinking on using hyperloglog like ones, despite less accurate than MinHash like ones.

Thanks,

Jianshu

bkille commented 1 year ago

Hi @jianshu93,

Hmmm... MashMap3 should not require 300gb of memory for a reference of that size. For example, indexing the human genome w/ a seed density of ~0.02 (sketch size 100, segment length 5000) can be done with ~12gb.

Could you share the command you're using to run MashMap3?

jianshu93 commented 1 year ago

Hello @bkille ,

I was using FastANI, essentially the idea is the same but just have one best target position mapped in the reference and then calculate identity. It should be also related to openMP parallelization in the fastANI implementation. You can try fastANI with 1000 genomes, all versus all mapping/calculation

Jianshu

bkille commented 1 year ago

Hi @jianshu93,

This seems like more of an issue w/ FastANI. My guess is that at the low ANI cutoff that FastANI uses (80/85%), it is finding and saving many candidate mappings in the MashMap step which is taking up significant amounts of memory. I have seen this happen w/ MashMap2, but that has been fixed by the probabilistic filter which MashMap3 uses.

bkille commented 1 year ago

Hi @jianshu93,

I'm closing this issue, but please feel free to contact me at my email if you want to discuss this further!

Best, Bryce

jianshu93 commented 11 months ago

Hello Bruce,

In the case of bottom-k sketch, will the variance be larger than the k-min implementation? I can imagine that using m hash functions (which are all a family of min-wise independent hash functions) and get the min value is different from that using just one hash function but store it and sort for all kmers later on to get the min value, especially when the set size is very different (not a problem for minmer-jaccard but for overall jaccard without using Minmer). Is there any theoretical analysis that this will not lose accuracy comparison to k-min one, with respect to different set size?

Thanks,

Jianshu

bkille commented 11 months ago

Hi Jianshu,

The bottom-k sketch has a variance which is at most the variance of the k-mins sketch. Here are some helpful slides showing a comparison between the two.

jianshu93 commented 11 months ago

Hello Bruce, Thanks for response but it seems the link is empty.

Jianshu

bkille commented 11 months ago

Ahh sorry! I've updated the hyperlink.

jianshu93 commented 11 months ago

Thanks! this is very clear now. But it remains not clear that the locality sensitive hashing property is preserved in bottom sketch (it is not because with one hash function, many position will be empty, thus not informative for LSH), it is one of the most important property for many downstream applications.

Jianshu

bkille commented 11 months ago

Hi @jianshu93,

I'm not sure I understand what you mean by "many positions will be empty," maybe we could meet over Zoom at some point to discuss in more detail?

jianshu93 commented 10 months ago

hello Bruce,

Sorry for the late response. I would be happy to talk more about Minhash et.al., my email is jianshu.zhao@gatech.edu

Thanks,

Jianshu