Closed rhpvorderman closed 2 years ago
It’s indeed a bit difficult to replace the alignment algorithm because it is so customized. I have been tuning it over the years and just switching it out for something else would introduce many regressions. I have a couple of ideas for how to make it faster and I would like to explore these first.
Hi,
Were external libraries such as parasail or striped smith waterman considered? The whole purpose of this would be of course to make cutadapt faster.
I have had a look at the python bindings for these libraries and it seems to me that parasail is the most promising. It uses ctypes, so there is quite some overhead, but I don't think that should matter much given the cost of Smith Waterman.
For scikit-bio has a maintained binding for the SSW library. If that would be used I recommend just copying it. Scikit-bio's dependency list can be best summarized as
from PyPI install *
.This will of course change the alignment characteristics. Parasail uses a nucleotide substitution table such as nuc 4.4 (see here). Which might give more 'fair' match scores to wildcards.
EDIT: I should have read a bit more. I see now in the docstring that cutadapt's algorithm is fairly custom. In that case I don't know if parasail can be applied properly, allthough it has both smith waterman and semi-global alignment functions.