vsbuffalo / scythe

A 3'-end adapter contaminant trimmer
MIT License
91 stars 38 forks source link

Proposal: bit-encoded sequence comparison #24

Closed kdm9 closed 9 years ago

kdm9 commented 9 years ago

Vince,

I've been thinking about how to further increase the speed of scythe.

One thing that I thought could be good is to 4-bit encode the adapter and each read (this is super fast, ~1Gbase per second encoding speed in my experience). Then you can use bit fiddling hacks to get a vector of mismatch/match (specifically, you do an XNOR then MASK-SHIFT-AND pattern to give you a vector of matches & mismatches, I have opimised C code for this already). I'd use 4 bits, meaning that you fit 16 bases in a uint64_t (and thus in a single instruction). This should be lightning fast; a colleague used this in a short read aligner to great effect (Can't find his paper). Note that this doesn't fix the issue of supporting indels.

Would you like me to have a go at implementing this over the next month or two as I have time?

Cheers, Kevin