rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.69k stars 119 forks source link

Add support for Smith Waterman algorithm #175

Open maxbachmann opened 2 years ago

maxbachmann commented 2 years ago

The Smith Waterman algorithm is a commonly used metric to compare strings. It would be useful to add it to RapidFuzz.

hongduosun commented 2 years ago

Hi, will this algorithm be added recently?

maxbachmann commented 2 years ago

I am not sure yet. I think I could add a simple implementation in the close future. When matching long sequences you would probably want to use a more optimized implementation like: https://github.com/jeffdaily/parasail (has python bindings) I do not think I will have the time to write an implementation that is even close to this level of optimization.

hongduosun commented 2 years ago

Got it, thanks for the great work!

maxbachmann commented 2 years ago

After looking at the paraseil python bindings I found them way to hard to use. E.g.

Be careful using the attributes of the Result object - especially on Result instances constructed on the fly. For example, calling parasail.sw_trace("asdf", "asdf", 11, 1, parasail.blosum62).cigar.seq returns a numpy.ndarray that wraps a pointer to memory that is invalid because the Cigar is deallocated before the seq statement. You can avoid this problem by assigning Result instances to variables as in the example above.

Is not really acceptable in my opinion. So I decided to add at least a simple implementation for people who just want to have a properly working implementation.