thoughts on comparing metagenomes for "similarity"

ctb commented 1 year ago

here are some thoughts I wrote up for a collaborator, thought I'd share broadly -

The default FracMinHash approach will yield overlap estimates when applied to two metagenomes. In technical terms this is a (quite good) estimate of the shared number of k-mers/De Bruijn Graph nodes shared between the two metagenomes. Loosely translated into bioinformatics speak, thought, overlap means “if we assembled both metagenomes perfectly, this number of bases would align at ~99% average nucleotide identity.” (Biologists/bioinformatics find the second statement a bit more intuitive, I think?)

You can normalize that overlap a few ways. If you normalize it by total number of k-mers across both metagenomes, you get Jaccard distance estimates, which is a distance metric. I tend to think this isn’t useful for metagenomes, because of a few things:

typically only metagenomes that have common origins / share a lot of starting material would have content mostly in common
there’s a strong dependency of the total number of k-mers in a data set based on sequencing depth
metagenomes may have strong functional similarity and still have low Jaccard similarity
metagenomes may share a lot of genomic content while still having low Jaccard similarity

So Jaccard in particular is a fraught metric - it tends to underestimate the things we care about. We have other comparison outputs including max containment, happy to chat more about this.

We honestly haven’t figured out a good way of talking about shotgun metagenome content comparison.

A few ideas -

abundance filtering within a data set (see sourmash sig filter). This tends to be fragile (changes a lot with) to sequencing depth - low sequenced samples just vanish.
common hash filtering a la https://github.com/ctb/sourmash_plugin_commonhash - look for (even low abundance) k-mers that are shared across data sets. This gets rid of unique/incredibly rare and/or erroneous hashes. Not sure how to do this for 1m samples on an SRA scale😂
defining minimum or approximate set covers, wherein we find a “representative” subset of metagenomes that covers (most of) the content of a bunch more metagenomes. Again, not sure how to do this for 1m samples.

Abundance filtering is easy to do computationally. Common hash filtering and min set covers is much harder.

It is also relatively straightforward to compare genome/species-level gather catalogs for metagenomes. However, this will privilege non-environmental data sets.

ctb commented 1 year ago

my favorite response to this came from @mr-eyes -

[ You ... ] should include metadata-based
clustering instead of working on many unrelated metagenomes. A specific
metric (distance/containment) or threshold cannot get even clusters. Metagenomes are spaghetti, and ... [working with them at large scale is challenging].

ctb commented 11 months ago

ref https://github.com/sourmash-bio/sourmash/issues/1135

sourmash-bio / sourmash

thoughts on comparing metagenomes for "similarity" #2735