predicting the number of strains present in a sample using gather results alone

I just ran gather on some genomes (attaching a snippet of results below). These are all single genomes.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
5.3 Mbp       44.2%   44.6%       1.1    GCA_006942235.1 Debaryomyces hansenii strain=NRRL Y-7426, Dha_7426
4.3 Mbp       10.0%   10.2%       1.0    GCA_006942205.1 Debaryomyces hansenii strain=J26, Dha_j26
2.1 Mbp        2.0%    1.9%       1.1    GCA_006942325.1 Debaryomyces hansenii strain=NRRL YB-155, Dha_155
4.3 Mbp        0.9%    0.9%       1.1    GCA_006942215.1 Debaryomyces hansenii strain=NRRL YB-398, Dha_398
1.7 Mbp        1.1%    0.7%       1.5    GCA_016097515.1 Debaryomyces hansenii strain=C0-11-Y2, CAU_DHC11_v01_pri
1.8 Mbp        0.5%    0.5%       1.0    GCA_016097625.1 Debaryomyces hansenii strain=C0-11-Y2, CAU_DHC11_v01_alt
the recovered matches hit 58.7% of the abundance-weighted query

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
35.2 Mbp      98.5%   98.2%       1.1    GCA_013138035.1 Penicillium sp. str. #12 strain=#12, ASM1313803v1
34.9 Mbp       0.5%    0.5%       1.0    GCA_008931925.1 Penicillium sp. BW_12 strain=BW_12, ASM893192v1
the recovered matches hit 99.0% of the abundance-weighted query

After staring at enough gather outputs, I have an intuition about whether multiple strains are present or only one strain is there and we are covering it’s full genome using the pangenome of the species. In this case, since these are actually genome sequences, we know that it’s the pangenome covering the genome.

What signals could we use to predict how many strains are actually present?

Some ideas:

In the case of a single sample, it’s more than likely a pangenome covering a single genome if:
- all of the genomes of a species have the ~same avg_abund
- the sum of f_unique (output in the gather csv) is less than one
In the case of multiple genomes,
- if the same genomes are predicted for a species over multiple samples, and those genomes have approximately the same abundance within a single sample and across samples, and the f_unique doesn't sum to more than 1, it’s probably a pangenome covering a single genome.
- if different genomes are predicted for a species over multiple samples, or the genomes predicted to cover a sample have different abundances within a sample, or the genomes predicted to cover a sample have an f_unique that sums to greater than 1, then it’s probably multiple strains present in the sample.

This is a summary of a DIB lab slack conversation. @ctb contributed the idea about summing over f_unique:

if, for a single sample, gather finds many different strains for a single species, you could look at (a) the abundances of what it finds and (b) the f_unique for the strains (not displayed in the stdout). If the f_unique or the unique intersect bp sums to more than the typical size of a genome, that also tells you that multiple strains are present.

I think I can whip up some R code to look at this patterns pretty easily. if/when I do I'll report back!

perhaps relevant:

https://www.biorxiv.org/content/10.1101/2022.10.18.512733v1

bioRxivbioRxiv Mora: abundance aware metagenomic read re-assignment for disentangling similar strains Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community. While species level classification has become routine, correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated data, this allows Mora to achieve F1 scores of >74% when assigning reads generated from three distinct E. coli strains, much higher than the 29% and 32% achieved by Pathoscope2 and Pufferfish. Furthermore, we show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately identify the presence of low abundance strains and species. ### Competing Interest Statement The authors have declared no competing interest.

sourmash-bio / sourmash

predicting the number of strains present in a sample using gather results alone #2320