Open taylorreiter opened 1 year ago
perhaps relevant:
https://www.biorxiv.org/content/10.1101/2022.10.18.512733v1
bioRxivbioRxiv Mora: abundance aware metagenomic read re-assignment for disentangling similar strains Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community. While species level classification has become routine, correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated data, this allows Mora to achieve F1 scores of >74% when assigning reads generated from three distinct E. coli strains, much higher than the 29% and 32% achieved by Pathoscope2 and Pufferfish. Furthermore, we show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately identify the presence of low abundance strains and species. ### Competing Interest Statement The authors have declared no competing interest.
I just ran gather on some genomes (attaching a snippet of results below). These are all single genomes.
After staring at enough gather outputs, I have an intuition about whether multiple strains are present or only one strain is there and we are covering it’s full genome using the pangenome of the species. In this case, since these are actually genome sequences, we know that it’s the pangenome covering the genome.
What signals could we use to predict how many strains are actually present?
Some ideas:
f_unique
(output in the gather csv) is less than onef_unique
doesn't sum to more than 1, it’s probably a pangenome covering a single genome.f_unique
that sums to greater than 1, then it’s probably multiple strains present in the sample.This is a summary of a DIB lab slack conversation. @ctb contributed the idea about summing over f_unique:
I think I can whip up some R code to look at this patterns pretty easily. if/when I do I'll report back!