can we take into account shared k-mers between genomes for abundance calculations?

ctb commented 2 years ago

over on twitter, ben woodcroft asked if/when sourmash would report the relative abundance of each genome, taking into account shared k-mers.

this is hard, and we've thought a fair bit about it, and so I wanted to write down some of the thoughts.

the first problem

once you look at well-sequenced species (species with many individual genomes in database) you'll inevitably see that many of the genomes within a species share quite a bit of nucleotide sequence, and hence also share many k-mers. so how can we take that into account when assigning abundances to a specific genome with (e.g.) sourmash gather?

ctb's naivest perspective

this is what gather estimates already, by using a greedy approach to pick out the large collection of hashes unique to each match. average_abund and median_abund then provide the abundance of those hashes in the metagenome data. Our taxonomic benchmarking results suggest that this is a pretty good measure, at least in ~simple situations that we benchmark. These numbers can then be used as a count for downstream relative abundance estimation.

but can we do better?

there are several potential failure modes that I can see -

one is where there is a particularly ugly mixture of strains such that there are no unique bits to pick out and measure. this will mess up our genome assignment approach. I have no idea how frequently this happens and no intuition about it beyond suspecting that it is not a frequent problem, because of the high discriminating ability of 31- and 51-mers (see graph in appendix at bottom of issue), which means that there are generally unique bits to measure.
a less ugly version of the first failure mode is one where there is a big mixture and the unique bits for each strain are very small and do not support sufficiently accurate measurement of abundance. This probably happens, but may be overshadowed by reference database issues - see the next two modes.
another failure mode is one where the "true" strain's genome is not in the database, and the k-mer mismatches b/t database and metagenome lead to inaccurate abundance estimation numbers. This certainly happens in many situations!
a final failure mode is where the "true" strain's genome is shattered amongst the reference genomes, and so k-mers are getting mis-assigned to strains. We suspect this is happening a lot, too.

high level thoughts

any reference-based technique is going to fail for reasons (3) and (4), and we have a lot of anecdata (and some actual data ;) suggesting that this happens a fair bit. mapping based approaches will have more tolerance for mismatches, but not a whole lot more.

for relative abundance calculations, I'm not sure that (3) matters that much, except for issues of detection rate/sensitivity - bias in the numbers should cancel out here, right?

possible solution: detecting strain variation with respect to reference

the [metapalette]() paper implements a really nice approach to detecting the presence of divergent strains by looking at how k-mer-based genome detection shifts with k-mer size. see https://github.com/sourmash-bio/sourmash/issues/267 for some discussion.

we can also now estimate ANI with respect to references directly from k-mers, using sourmash. still thinking about how to make use of that (see #267 and #2226).

another solution: aggregating to species level

@taylorreiter has repeatedly made the point to me there is no ground truth given that the thing in the database is almost never the thing in the metagenome. so instead she (a) summarizes to the species level where that's acceptable, and (b) works at the sequence level when it's not (k-mer, dominating set viz spacegraphcats, or gene). and that's what her phyloseq tutorial does - summarize to species level.

exploratory: use spacegraphcats

we are also using spacegraphcats to tackle (3) and (4) by recruiting accessory elements into the mix, and looking at abundances across and between k-mer collections.

but this is a heavyweight approach because sgc is, well, quite heavyweight. it is nonetheless the least reference-biased :).

other approaches

I haven't fully grokked the implications of it yet, but AGAMEMNON looks like an interesting approach that is salmon-like in its approach.

appendix

per @bluegenes, k-mers are genus, species, and VERY genome specific.

Screen Shot 2022-08-24 at 10 22 34 AM

ctb commented 2 years ago

one followup point - to do the aggregation to species level, you do need to pick a taxonomy. spacegraphcats can somewhat obviate that by aggregating to neighborhood.

ctb commented 2 years ago

lots of discussion in twitter thread: https://twitter.com/ctitusbrown/status/1562491361974046721

ctb commented 2 years ago

I think it would be interesting to look at the RNAseq approach with EM and see how and where it might fail.

Research idea - can we detect unusually complicated strain situations based on failure of an EM model to converge given a particular set of reference genomes?

sourmash-bio / sourmash