rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.69k stars 119 forks source link

Other string matching algorithms? #13

Closed shashi-netra closed 2 years ago

shashi-netra commented 4 years ago

Have you considered other algorithms such as Smith-Waterman, Jaro-Winkler, or even the new/improved Levenshtein-Damerau?

maxbachmann commented 4 years ago

My first goal with this library was to reimplement the functionality of FuzzyWuzzy since I needed it for rhasspy, so I only implemented the algorithms relevant for this. Now that this is done I am open to suggestions on other cool string matching algorithms that could be added in future versions. (I always love seeing Pull Requests adding them aswell ^^)

logannc commented 4 years ago

Related to #8 and this issue, I am working on https://github.com/logannc/fuzzywuzzy-rs, if you're looking to collaborate. (please forgive the mess, I ported it years ago without much understanding and now am trying to clean it up)

Relevant for this issue, something I'm looking at is whether certain functions are living up to their purpose and, if not, whether an algorithm change is appropriate.

Specifically, partial_ratio has been documented as attempting to take the ratio of the 'optimal string alignment'. However, I suspect that the alignment might not be optimal (or I am misunderstanding what the metric used to judge optimality is).

For example, take strings that are the (human readable) mirror of each other:

"what about supercalifragilisticexpialidocious"
            ||||||||||||||||||||||||||||||||||
           "supercalifragilisticexpialidocious about what"

My expectation of the optimal local alignment (because that seems to implicitly be the unmentioned kind of alignment we want) for strings like this whose opposite ends are strongly aligned should be just that end piece. That is, for my example strings, the optimal local alignment would just be "supercalifragilisticexpialidocious" for both, yielding a ratio of 100% when compared. Thoughts?

It's unclear to me what algorithm rapidfuzz and fuzzywuzzy are actually using to align strings, so I'd love to be enlightened on that point. Let me know if you'd like to take this conversation elsewhere (a different issue, email, discord, IRC, etc).

maxbachmann commented 4 years ago

fuzz.partial_ratio uses difflib.get_matching_blocks to find the optimal alignment between two strings and then calculates a ratio on this. However it is only allowed to shift the shorter one to the right side, so when you have two strings of equal length as you do it would either find the alignment

"what about supercalifragilisticexpialidocious"
            ||||||||||||||||||||||||||||||||||
           "supercalifragilisticexpialidocious about what"

or

"what about supercalifragilisticexpialidocious"
"supercalifragilisticexpialidocious about what"

depending on the way you pass it the strings. Afterwards it will take a part of the long string starting where they align, that has the length of the shorter string and match them. When there are not enough characters left in the longer string it will simply stop when there are no more characters. So in your example it would match either:

"supercalifragilisticexpialidocious"
"supercalifragilisticexpialidocious about what"

or

"what about supercalifragilisticexpialidocious"
"supercalifragilisticexpialidocious about what"

So in this scenario it does perfom really bad and it even matters which way around you pass it the strings. Generally it is ment to be used when one string is a lot smaller than the other string (e.g. in fuzz.WRatio it is used when one string is 1.5 times longer than the other string). So while it might not be perfect I guess it is good enough for this use case. I personally only implemented it this way, since people already use fuzzywuzzy and so I want to give them a similar behaviour (I would however be absolutely fine with having another algorithm that actually optimally aligns them ;) )

Btw for strings that are e.g. mirrored like yours theres fuzz.token_sort_ratio

logannc commented 4 years ago

However it is only allowed to shift the shorter one to the right side, so when you have two strings of equal length as you do it would either find the alignment

This is a wonderful clarification, thank you!

Yea, I'm not particularly satisfied with that behaviour. I may implement a new version using Smith-Waterman and have a legacy version of partial_ratio for compatibility.

maxbachmann commented 4 years ago

Yes I think Smith-Waterman would be much better suited for this purpose.

maxbachmann commented 2 years ago

Jaro-Winkler similarity and Jaro similarity got added in one of the recent releases. The other requested distance will be a added and are tracked in their own issues: