hts_SeqScreener is meant to filter/identify reads originating from specific source sequences (PhiX as default, but also ribosomal sequences or adapters etc).
Is your enhancement request related to a problem? Please describe.
Currently hts_SeqScreener is not optimized for large references. It hasn't been tested much or at all on human sized genomes (~3gbp), but is not expected to work well, and would be very slow.
hts_SeqScreener is meant to filter/identify reads originating from specific source sequences (PhiX as default, but also ribosomal sequences or adapters etc).
Is your enhancement request related to a problem? Please describe. Currently hts_SeqScreener is not optimized for large references. It hasn't been tested much or at all on human sized genomes (~3gbp), but is not expected to work well, and would be very slow.
Describe the solution you'd like A number of alternative algorithms/data structures have been designed to speed up similar processes. Mapping is essentially the same: Minimap2: https://github.com/lh3/minimap2#algo minimizer schemes: https://www.biorxiv.org/content/10.1101/652925v1.full.pdf https://homolog.us/blogs/bioinfo/2017/10/25/intro-minimizer/ https://pdfs.semanticscholar.org/18a3/3e90b5e6872d33e32c4b9bd6f2fe577be8d6.pdf
But there is also Kraken2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0
Implementing something similar to what is used in one of these tools could make screening against a human size genome possible
Additional context