logannc / fuzzywuzzy-rs

port of https://github.com/seatgeek/fuzzywuzzy
GNU General Public License v2.0
40 stars 3 forks source link

Implement optimal string alignment algorithm #19

Open logannc opened 3 years ago

logannc commented 3 years ago

See https://github.com/maxbachmann/rapidfuzz/issues/13 for generous details from another fuzzywuzzy compatible project author.

Essentially, partial_ratio attempts to align strings optimally, then take the ratio of the aligned string subsets. The method of alignment used in fuzzywuzzy is.. pretty bad, actually.

Implement Smith-Waterman for sequence alignment to power partial_ratio. Legacy partial_ratio behaviour will be relegated to a compatibility function (which, if there are more, might be put into a compatibility module).

logannc commented 3 years ago

Example of a random python Smith-Waterman alignment:

>>> sw.align("what about supergreatfantastic theatre", "superduperfantastic theater about what").dump()
Query:  1 superdupe-rfantastic theater 27
          ||||| ..| .||||||||||||||| |
Ref  : 12 super-greatfantastic theat-r 37

Score: 38
Matches: 22 (78.6%)

This seems clearly superior to the existing alignment algorithm.

logannc commented 3 years ago

In particular when comparing the following:

>>> sw.align("what about superduperfantastic", "superduperfantastic about what").dump()
Query:  1 superduperfantastic 19
          |||||||||||||||||||
Ref  : 12 superduperfantastic 30

Score: 38
Matches: 19 (100.0%)
Mismatches: 0
CIGAR: 19M
>>> partial_ratio("what about superduperfantastic", "superduperfantastic about what")
63