mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
203 stars 17 forks source link

Feature: Min/max coverage threshold #37

Closed bbalog87 closed 2 years ago

bbalog87 commented 2 years ago

Hi, thank you for the great tool!

Sometimes, we need to subsample long reads (e.g HiFi) to retain only reads that are at least x times present in the reads set. Let's say I need to keep only reads with at least 40x coverage. This is particularly crucial to automatically exclude all erroneous reads/k-mers with potential contaminations. Because errors and contaminations do not usually have deep coverage in the genome. That is why erroneous k-mers are always at the beginning of the k-mers histograms with a low depth, usually < 20. Having an option to subsample only reads with a min coverage would maybe exclude most erroneous reads and contaminations, hence improving the assembly for instance.

mbhall88 commented 2 years ago

Correct me if I'm wrong, but to do this you would need an alignment (i.e., BAM/SAM file)? Because to know if two reads constitute 2x coverage, you need to know they come from the same part of the genome correct?

If that is the case, I don't currently support alignment subsampling (and likely wont), but https://github.com/mbhall88/rasusa/issues/17#issuecomment-708165373 links to a tool that can randomly subsample an alignment.

If I have the wrong idea could you please clarify?

bbalog87 commented 2 years ago

Hi @mbhall88 You are completely right. That was also my thought. Most of the tools out there including seqtk, seqkit etc. only subsample (randomly) a subset of reads to reduce the overall coverage. This is what is needed most of the time. Alignment-based subsampling is only a special usecase.

Thanks for the clarification and for the hint on issue 17.