rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.69k stars 119 forks source link

Quick Question #389

Closed kd10041 closed 4 months ago

kd10041 commented 4 months ago

Where can I see the implementation of .partial_ratio() ? Can you let me know the logic which is utilized for this method. Thanks in advance!

kd10041 commented 4 months ago

I am new to python! So I went through the codebase of rapidfuzz. So basically the implementation of partial is divided to 2 parts

1. short needle implementation length<=64

Here a sliding window of length=min(len(s1),len(s2)) is used. and fuzz.ratio() is calculated on the all the alignments possible. for example: I took this example from this issue

s1='real'
s2='barcelona'
fuzz.ratio(s1,s2)

So the best alignment here is window size is equals to 4 here.

r_eal
rce_l 

This requires two operations output is 1-2/(4+4) = 75 which is the exact output as given by rapidfuzz.

2. long needle implementation length>64

This is similar to as implemented by fuzzywuzzy. The logic here is find the best alignment from shorter string to the Longest common substring of longer string. and find similarity score using fuzz.ratio() from the needle to longest common substring.

Can anyone give example clear this part up? Any help would be appreciated regarding this.

maxbachmann commented 4 months ago

Oh the documentation is simply outdated. In the past I did use two implementations since I didn't have a way to make the implementation for long needles reasonably fast. However this did mean that the implementation for longer needles was similar to whats done in fuzzywuzzy, which doesn't always provide the correct results.

I have since found a better way to filter out impossible results and so I use the "correct" implementation both for short and long needles. You will still notice a drop in performance once the needle has more than 64 characters though.

From a user perspective it's simply a sliding window where the substring taken from the longer string has a length of.

The pure Python fallback implementation is: https://github.com/rapidfuzz/RapidFuzz/blob/9359be25a503ed5cb1fb75defac924c5e379048a/src/rapidfuzz/fuzz_py.py#L118 The C++ implementation is https://github.com/rapidfuzz/rapidfuzz-cpp/blob/10426d24cd7479df0fe8c78b17877e756e1c3cd5/rapidfuzz/fuzz_impl.hpp#L68

The actual implementation doesn't actually check all alignments since it can use knowledge about the maximum distance change per shift of the sliding window to filter out some comparisons.

kd10041 commented 4 months ago

Thank you @maxbachmann for clear explanation. Do you plan on updating the documentation anytime soon?

maxbachmann commented 4 months ago

Yes I will probably fix the docs at some point this week