Open wwood opened 1 week ago
Hi @wwood ,
With breadth of coverage I assume you mean getting the fraction of positions over a sequence with at least X reads mapped.
It is not possible with strobealign --aemb
. The only way I see this possible to compute is if running strobealign in default mode (to get a SAM file) and then run a downstream tool to do this computation (maybe CoverM has such a feature?).
I have heard this request from another user as well. Perhaps something to consider, although it would require significant memory overhead to store coverage arrays per contig.
CC @luispedro, as aemb is really his brainchild
Thanks for the quick response - yes you are right about what I was after.
And yes, CoverM can already do this (with -m covered_fraction
or covered_bases
). But as you say, this requires not using --aemb
so SAM is output, so a bit slower.
You can use the mosdepth data structure which is more efficient instead of an array of per-base coverages (need only to increment where the read starts mapping and decrementing where it stops, which makes it faster but not less memory intensive).
I idly wonder whether that data structure could be compressed for metagenomic read mapping where there is many (and many low coverage) contigs somehow. For a start the mosdepth array could be run-length encoded, but that would mean losing the constant time insertion. So maybe some tradeoff like splitting up each 10kb segment or something, or whitelisting contigs pass a threshold so there's no further need to keep a coverage array.
Another option would be to bin the contigs into say 500bp segments, and track how many of those segments are covered by a read. Not exactly the same, but would be an approximation for less memory intensity than a full coverage array and everything is constant time. For entire genomes 500kb+ it would probably be near-identical.
Anyway, curious what you think Luis.
I think it is not too difficult conceptually to extend AEMB to have some estimate of breadth. Particularly, if you are worried about extreme cases (eliminating very low coverage contigs/genomes) rather than a precise estimate of coverage (being able to distinguish 10% coverage from 8% or 15%)
You can probably just check if the strobemers in the reads cover enough of the strobemers from the reference (IIRC, kraken-uniq is based on a similar idea, but it's been a while since I looked at that).
I am happy to point people in the right direction, but I don't have the free cycles to write the code or test the idea, though.
Hi,
Just wondering whether this might be possible? Or no?
I ran into this trying to use
strobealign --aemb
for calculating the coverage of a genome in CoverM - by default it requires 10% of the genome to be covered by at least 1 read, otherwise coverage is set to 0.Thanks.