ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
150 stars 17 forks source link

estimating breadth of coverage with strobealign --aemb #456

Open wwood opened 1 week ago

wwood commented 1 week ago

Hi,

Just wondering whether this might be possible? Or no?

I ran into this trying to use strobealign --aemb for calculating the coverage of a genome in CoverM - by default it requires 10% of the genome to be covered by at least 1 read, otherwise coverage is set to 0.

Thanks.

ksahlin commented 1 week ago

Hi @wwood ,

With breadth of coverage I assume you mean getting the fraction of positions over a sequence with at least X reads mapped.

It is not possible with strobealign --aemb. The only way I see this possible to compute is if running strobealign in default mode (to get a SAM file) and then run a downstream tool to do this computation (maybe CoverM has such a feature?).

I have heard this request from another user as well. Perhaps something to consider, although it would require significant memory overhead to store coverage arrays per contig.

CC @luispedro, as aemb is really his brainchild

wwood commented 1 week ago

Thanks for the quick response - yes you are right about what I was after.

And yes, CoverM can already do this (with -m covered_fraction or covered_bases). But as you say, this requires not using --aemb so SAM is output, so a bit slower.

You can use the mosdepth data structure which is more efficient instead of an array of per-base coverages (need only to increment where the read starts mapping and decrementing where it stops, which makes it faster but not less memory intensive).

I idly wonder whether that data structure could be compressed for metagenomic read mapping where there is many (and many low coverage) contigs somehow. For a start the mosdepth array could be run-length encoded, but that would mean losing the constant time insertion. So maybe some tradeoff like splitting up each 10kb segment or something, or whitelisting contigs pass a threshold so there's no further need to keep a coverage array.

Another option would be to bin the contigs into say 500bp segments, and track how many of those segments are covered by a read. Not exactly the same, but would be an approximation for less memory intensity than a full coverage array and everything is constant time. For entire genomes 500kb+ it would probably be near-identical.

Anyway, curious what you think Luis.

luispedro commented 1 week ago

I think it is not too difficult conceptually to extend AEMB to have some estimate of breadth. Particularly, if you are worried about extreme cases (eliminating very low coverage contigs/genomes) rather than a precise estimate of coverage (being able to distinguish 10% coverage from 8% or 15%)

You can probably just check if the strobemers in the reads cover enough of the strobemers from the reference (IIRC, kraken-uniq is based on a similar idea, but it's been a while since I looked at that).

I am happy to point people in the right direction, but I don't have the free cycles to write the code or test the idea, though.