sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Within signature distance? #33

Open anmwinter opened 8 years ago

anmwinter commented 8 years ago

I was wondering if it was worth looking at within signature differences to get an idea of the diversity of hashes? i.e using Jaccard, Hammings or Lenvestein metrics.

I need to check my understanding of how this whole hashing thing works for a metagenome: 1) For a given set of sequences the file is read in and we get kmers back for each sequence. 2) Each set of kmers is hashed into a signature. Are the kmers grouped before or after hashing? 3) Hashes from two signatures are compared and from the union we can calculate the Jaccard distance based on shared hashes.

I hacked together a BK-Tree script from someone else's code to look at which hashes might be similar within a given distance.

nmb85 commented 4 years ago

Sorry to raise an old issue from the dead, but this is the only mention of my question that I could find. Do the distances between hashes, i.e., the edit (hamming, levenshtein, etc) distance between, for example, 16571421240 <-> 111660087188, contain any underlying kmer <-> kmer distance information? It would be interesting to build a tree like suggested and use that for a unifrac distance calculation on a collection of signatures.

ctb commented 3 years ago

wow, this IS an old issue!

um, there is no k-mer <-> k-mer edit distance information available, I'm afraid. The hashing function explicitly breaks any link between similar k-mers, in fact!

ctb commented 3 years ago

to answer @anmwinter question --

first, major apologies for taking so long!

second, any metric that can operate on k-mers should work with k-mer hashes from scaled signatures. we've looked at things like accumulation curves (see comment) and and also @taylorreiter has thought a bit about diversity metrics for collections of k-mers. Happy to chat further if you have a specific calculation to do!

nmb85 commented 3 years ago

@taylorreiter (and anyone else, haha!) When you have a moment (I realize that you and your team are busy with your main projects), I was wondering what you thought about this:

Could one build a tree like the BK Tree mentioned by @anmwinter, place minhash abundances on that tree, and use the earth mover's distance to get a diversity-informed distance metric between minhash sets? Earth mover's distance is equivalent to unifrac, I think...

Regardless of whether or not it is phylogenetically informative, it still would give a more information-rich distance metric between datasets than Jaccard, right?

taylorreiter commented 3 years ago

I don't have an intuition about whether this would work or not. Can you describe a little more how this would be more information-rich than Jaccard? Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?

As Titus said above, if it works on k-mers, it should work on minhashes. It might be worth exploring a proof of concept in k-mer space?

nmb85 commented 3 years ago

Hi, @taylorreiter , thanks very much for your thoughts! First re your questions:

Can you describe a little more how this would be more information-rich than Jaccard?

Sure, as far as I understand it and in analogy only (I'm not fluent in symbolic math, to my chagrin): whereas Jaccard Distance tells you how many kmers/hashes are shared between samples, Earth Mover's Distance (EMD) would tell you how many kmers/hashes "piled up" at defined locations in a 2-D area (think of it as mounds, ponds, and slopes in your backyard: a landscape) that constitute one sample need to be moved to form the same landscape in another sample.

So, to give a practical example that you're probably familiar with, and please ignore if you already know this: in the case of 16s rRNA sequences, earth mover's distance is the same as the weighted unifrac distance. Unifrac is based on the phylogenetic distribution of the sequences across a phylogenetic tree (which incorporates not just the classification of the kmer/sequence but also the phylogenetic distance between the kmers/sequences). The phylogenetic tree that is formed from all samples from which measurements will be made is analogous to the 2-D area across which "dirt" is distributed in samples measured by EMD. Weighted unifrac is the most information-rich distance metric that I'm aware of for phylogenetic trees that contain count information.

With hashes, you can't build a phylogenetic tree of hash distances because hashes don't contain phylogenetic information per @ctb 's comment above - however - if there is still true diversity information contained in those hashes, i.e., hash similarity is proportional somehow to the similarity between the kmers represented by those hashes (which is something I don't know, and my main question to you), then the earth mover's distance of hash sets shared by samples would add at least some information about the similarity between the diversity distribution of the samples. So, as @anmwinter suggested hierarchically clustering hashes by their pairwise edit distances, it struck me that you may be able to use the diversity information contained in the BK-tree to inform an EMD between hash sets (analogously to phylogenetic trees + weighted unifrac for 16S sequences). That would provide a much more information-rich distance metric between samples than Jaccard.

Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?

Yes, plus it accounts for the difference in the structure of diversity between two samples, which is an additional source of information.

It might be worth exploring a proof of concept in k-mer space?

Would hierarchically clustering kmers like that be computationally practical?

So, just to clarify, my main question to you in this post is: "Is hash similarity proportional somehow to the similarity between the kmers represented by those hashes?"

Thanks, Taylor!

luizirber commented 3 years ago

Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...

(Sorry for short message, will comment more next week)

nmb85 commented 3 years ago

Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...

@luizirber For sure! I think count data is required for calculating the earth mover's distance (aka Wassertein or Kantorovich–Rubinstein).

An alternative distance measure that is still more informative than the simple Jaccard distance, if a tree representing hash diversity is not a useful measure of diversity, is the weighted Jaccard and is proposed in this paper on the Order MinHash (OMH). Have you heard of this type of hash? It wouldn't be as informative as EMD, but it might be a nice option if the EMD is impossible to calculate from hashes.

luizirber commented 3 years ago

I played with the weighted Jaccard in another codebase, and it is fairly straighforward to do the same for Scaled MinHash sketches. And while WJ is not well defined for regular MinHash (you would need something like CWS for it), it does work for Scaled MinHash sketches.

(deeper analysis on why WJ works for Scaled MinHash sketches pending =])

ctb commented 3 years ago

(note that we have some empirical evidence now that hash abundances are properly representative - see these (not necessarily permanent... sorry...) links, figure 8 -- hash abundance correlates with mapping for mock and high-coverage real data sets.

https://ctb.github.io/2020-grist-examples/reports/report-SRR606249.html https://ctb.github.io/2020-grist-examples/reports/report-p8808mo11.html

These are for sourmash gather output, not sourmash --containment output, but I can produce them for the latter easily enough should there be interest.

ctb commented 3 years ago

(key figures screenshotted for futureproofing -- )

Screen Shot 2020-11-07 at 7 11 01 AM Screen Shot 2020-11-07 at 7 10 55 AM