sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

predicting the number of strains present in a sample using gather results alone #2320

Open taylorreiter opened 1 year ago

taylorreiter commented 1 year ago

I just ran gather on some genomes (attaching a snippet of results below). These are all single genomes.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
5.3 Mbp       44.2%   44.6%       1.1    GCA_006942235.1 Debaryomyces hansenii strain=NRRL Y-7426, Dha_7426
4.3 Mbp       10.0%   10.2%       1.0    GCA_006942205.1 Debaryomyces hansenii strain=J26, Dha_j26
2.1 Mbp        2.0%    1.9%       1.1    GCA_006942325.1 Debaryomyces hansenii strain=NRRL YB-155, Dha_155
4.3 Mbp        0.9%    0.9%       1.1    GCA_006942215.1 Debaryomyces hansenii strain=NRRL YB-398, Dha_398
1.7 Mbp        1.1%    0.7%       1.5    GCA_016097515.1 Debaryomyces hansenii strain=C0-11-Y2, CAU_DHC11_v01_pri
1.8 Mbp        0.5%    0.5%       1.0    GCA_016097625.1 Debaryomyces hansenii strain=C0-11-Y2, CAU_DHC11_v01_alt
the recovered matches hit 58.7% of the abundance-weighted query

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
35.2 Mbp      98.5%   98.2%       1.1    GCA_013138035.1 Penicillium sp. str. #12 strain=#12, ASM1313803v1
34.9 Mbp       0.5%    0.5%       1.0    GCA_008931925.1 Penicillium sp. BW_12 strain=BW_12, ASM893192v1
the recovered matches hit 99.0% of the abundance-weighted query

After staring at enough gather outputs, I have an intuition about whether multiple strains are present or only one strain is there and we are covering it’s full genome using the pangenome of the species. In this case, since these are actually genome sequences, we know that it’s the pangenome covering the genome.

What signals could we use to predict how many strains are actually present?

Some ideas:

This is a summary of a DIB lab slack conversation. @ctb contributed the idea about summing over f_unique:

if, for a single sample, gather finds many different strains for a single species, you could look at (a) the abundances of what it finds and (b) the f_unique for the strains (not displayed in the stdout). If the f_unique or the unique intersect bp sums to more than the typical size of a genome, that also tells you that multiple strains are present.

I think I can whip up some R code to look at this patterns pretty easily. if/when I do I'll report back!

ctb commented 1 year ago

perhaps relevant:

https://www.biorxiv.org/content/10.1101/2022.10.18.512733v1

bioRxivbioRxiv Mora: abundance aware metagenomic read re-assignment for disentangling similar strains Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community. While species level classification has become routine, correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated data, this allows Mora to achieve F1 scores of >74% when assigning reads generated from three distinct E. coli strains, much higher than the 29% and 32% achieved by Pathoscope2 and Pufferfish. Furthermore, we show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately identify the presence of low abundance strains and species. ### Competing Interest Statement The authors have declared no competing interest.