Open ctb opened 1 year ago
my favorite response to this came from @mr-eyes -
[ You ... ] should include metadata-based
clustering instead of working on many unrelated metagenomes. A specific
metric (distance/containment) or threshold cannot get even clusters. Metagenomes are spaghetti, and ... [working with them at large scale is challenging].
here are some thoughts I wrote up for a collaborator, thought I'd share broadly -
The default FracMinHash approach will yield overlap estimates when applied to two metagenomes. In technical terms this is a (quite good) estimate of the shared number of k-mers/De Bruijn Graph nodes shared between the two metagenomes. Loosely translated into bioinformatics speak, thought, overlap means “if we assembled both metagenomes perfectly, this number of bases would align at ~99% average nucleotide identity.” (Biologists/bioinformatics find the second statement a bit more intuitive, I think?)
You can normalize that overlap a few ways. If you normalize it by total number of k-mers across both metagenomes, you get Jaccard distance estimates, which is a distance metric. I tend to think this isn’t useful for metagenomes, because of a few things:
So Jaccard in particular is a fraught metric - it tends to underestimate the things we care about. We have other comparison outputs including max containment, happy to chat more about this.
We honestly haven’t figured out a good way of talking about shotgun metagenome content comparison.
A few ideas -
sourmash sig filter
). This tends to be fragile (changes a lot with) to sequencing depth - low sequenced samples just vanish.Abundance filtering is easy to do computationally. Common hash filtering and min set covers is much harder.
It is also relatively straightforward to compare genome/species-level gather catalogs for metagenomes. However, this will privilege non-environmental data sets.