marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
520 stars 129 forks source link

External library options for alignment #610

Closed rhpvorderman closed 2 years ago

rhpvorderman commented 2 years ago

Hi,

Were external libraries such as parasail or striped smith waterman considered? The whole purpose of this would be of course to make cutadapt faster.

I have had a look at the python bindings for these libraries and it seems to me that parasail is the most promising. It uses ctypes, so there is quite some overhead, but I don't think that should matter much given the cost of Smith Waterman.

For scikit-bio has a maintained binding for the SSW library. If that would be used I recommend just copying it. Scikit-bio's dependency list can be best summarized as from PyPI install *.

This will of course change the alignment characteristics. Parasail uses a nucleotide substitution table such as nuc 4.4 (see here). Which might give more 'fair' match scores to wildcards.

EDIT: I should have read a bit more. I see now in the docstring that cutadapt's algorithm is fairly custom. In that case I don't know if parasail can be applied properly, allthough it has both smith waterman and semi-global alignment functions.

marcelm commented 2 years ago

It’s indeed a bit difficult to replace the alignment algorithm because it is so customized. I have been tuning it over the years and just switching it out for something else would introduce many regressions. I have a couple of ideas for how to make it faster and I would like to explore these first.