rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Could partial_ratio support Levenshtein.normalized_similarity? #401

Open rocke2020 opened 1 month ago

rocke2020 commented 1 month ago

rapidfuzz 3.9.6 and python 3.10. I have carefully read the rapidfuzz docs and made tests. I find: partial_ratio use Indel.normalized_similarity. Could partial_ratio also support Levenshtein.normalized_similarity? Maybe default to use Indel.normalized_similarity, but we can choose to use Levenshtein.normalized_similarity. In my use case, I indeed need partial_ratio to use Levenshtein.normalized_similarity. thanks!!

from icecream import ic
from rapidfuzz import fuzz
from rapidfuzz.distance import Levenshtein, Indel

a = '34cdef16z'
c = '09cdef78'
ic(fuzz.partial_ratio(a, c))
ic(fuzz.partial_ratio_alignment(a, c))
ic(fuzz.ratio(a[0:6], c))
ic(Indel.normalized_similarity(a[0:6], c) * 100)
ic(Levenshtein.normalized_similarity(a[0:6], c) * 100)

ic| test.py:25 in <module>- fuzz.partial_ratio(a, c): 57.14285714285714
ic| test.py:26 in <module>- fuzz.partial_ratio_alignment(a, c): ScoreAlignment(score=57.14285714285714, src_start=0, src_end=6, dest_start=0, dest_end=8)
ic| test.py:27 in <module>- fuzz.ratio(a[0:6], c): 57.14285714285714
ic| test.py:28 in <module>- Indel.normalized_similarity(a[0:6], c) * 100: 57.14285714285714
ic| test.py:29 in <module>- Levenshtein.normalized_similarity(a[0:6], c) * 100: 50.0
maxbachmann commented 3 weeks ago

In theory we could support this, but this comes with a couple of issues: 1) partial_ratio is used in a lot of the "combined" algorithms like WRatio. So it's not completely clear what to do about them 2) would this only allow Indel and Levenshtein which seems a bit arbitrary or more metrics? For more metrics it gets a lot more complex to handle things like the different kinds of normalizations 3) This has a couple of optimizations to keep the sliding window algorithm remotely fast by skipping some unneeded calculations. This doesn't work for the generic case though. Probably this could have a generic fallback, but it would likely need optimized versions for algorithms where this is possible.

One more possible setting could be whether the sliding window allows for shorter substrings at the start / end of the longer string.

Not completely sure whether this should be added under a different name either.

I assume a generic implementation shouldn't be too much work and optimized versions could be added over time. Still not sure when I will get around to it though.

rocke2020 commented 3 weeks ago

thanks a lot for maxbachmann. You have a global view on this issue! As Indel and Levenshtein are very similar, I think it would be easier only support Indel and Levenshtein in partial_ratio, to save your time. For me, I hope to only use Levenshtein in most NLP process.