sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

thoughts on comparing metagenomes for "similarity" #2735

Open ctb opened 1 year ago

ctb commented 1 year ago

here are some thoughts I wrote up for a collaborator, thought I'd share broadly -

The default FracMinHash approach will yield overlap estimates when applied to two metagenomes. In technical terms this is a (quite good) estimate of the shared number of k-mers/De Bruijn Graph nodes shared between the two metagenomes. Loosely translated into bioinformatics speak, thought, overlap means “if we assembled both metagenomes perfectly, this number of bases would align at ~99% average nucleotide identity.” (Biologists/bioinformatics find the second statement a bit more intuitive, I think?)

You can normalize that overlap a few ways. If you normalize it by total number of k-mers across both metagenomes, you get Jaccard distance estimates, which is a distance metric. I tend to think this isn’t useful for metagenomes, because of a few things:

So Jaccard in particular is a fraught metric - it tends to underestimate the things we care about. We have other comparison outputs including max containment, happy to chat more about this.

We honestly haven’t figured out a good way of talking about shotgun metagenome content comparison.

A few ideas -

Abundance filtering is easy to do computationally. Common hash filtering and min set covers is much harder.


It is also relatively straightforward to compare genome/species-level gather catalogs for metagenomes. However, this will privilege non-environmental data sets.

ctb commented 1 year ago

my favorite response to this came from @mr-eyes -

[ You ... ] should include metadata-based
clustering instead of working on many unrelated metagenomes. A specific
metric (distance/containment) or threshold cannot get even clusters. Metagenomes are spaghetti, and ... [working with them at large scale is challenging].

ctb commented 11 months ago

ref https://github.com/sourmash-bio/sourmash/issues/1135