s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

hts_SeqScreener enhancements for bigger references #227

Open samhunter opened 4 years ago

samhunter commented 4 years ago

hts_SeqScreener is meant to filter/identify reads originating from specific source sequences (PhiX as default, but also ribosomal sequences or adapters etc).

Is your enhancement request related to a problem? Please describe. Currently hts_SeqScreener is not optimized for large references. It hasn't been tested much or at all on human sized genomes (~3gbp), but is not expected to work well, and would be very slow.

Describe the solution you'd like A number of alternative algorithms/data structures have been designed to speed up similar processes. Mapping is essentially the same: Minimap2: https://github.com/lh3/minimap2#algo minimizer schemes: https://www.biorxiv.org/content/10.1101/652925v1.full.pdf https://homolog.us/blogs/bioinfo/2017/10/25/intro-minimizer/ https://pdfs.semanticscholar.org/18a3/3e90b5e6872d33e32c4b9bd6f2fe577be8d6.pdf

But there is also Kraken2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0

Implementing something similar to what is used in one of these tools could make screening against a human size genome possible

Additional context