rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Compute pairwise similarity #371

Closed thomasryde closed 5 months ago

thomasryde commented 5 months ago

Resolve #344

We have made a function called cpdist which compares the distance between corresponding elements of 2 lists of strings of equal length. The PR aims to remove the overhead of calling a distance/similarity function from python every time but handling everything inside C++.

The functionality is similar to cdist except that we do not compute the similarity between the cross product of elements but the corresponding element in the 2 lists.

thomasryde commented 5 months ago
Screenshot 2024-04-02 at 23 57 30

I have made a benchmark script with the rapidfuzz loop vs. cpdist. Note that this is slightly different to the ones made for cdist where we look at fuzzywuzzy loop vs cdist. Also note that I have not done an apples to apples comparison due to the amount of workers set in the cpdist version.

maxbachmann commented 5 months ago

I have made a benchmark script with the rapidfuzz loop vs. cpdist. Note that this is slightly different to the ones made for cdist where we look at fuzzywuzzy loop vs cdist.

I probably need to overhaul the benchmarks at some point. The comparison with fuzzywuzzy has become a bit meaningless, since fuzzywuzzy is developed under the name thefuzz now. thefuzz is basically just a compatibility wrapper around rapidfuzz at this point and so is relatively close in performance.

Also note that I have not done an apples to apples comparison due to the amount of workers set in the cpdist version.

Yes seems like for ratio and QRatio it actually helps to switch less frequently between Python and C++. For the others it looks to be mostly the threading. Sadly a lot of the optimizations used in cdist can't be applied when comparing individual elements.

maxbachmann commented 5 months ago

Figure_1

I am actually seeing larger improvements on my laptop :+1:

maxbachmann commented 5 months ago

I am currently working on a different improvement which only does the None check for input data if it's not already a string. For cpdist this change improves the result to:

Figure_2

Aside from the matrix flattening the change looks good now.

maxbachmann commented 5 months ago

At least for the fastest metrics this still appears to be bottle-necked by the conversion of the Python list of strings into my internal strings. I don't think there is much room for improvement there unfortunately, since all this does is:

This part can't be parallelized either, since the CPython interpreter can only be called from a single thread.

thomasryde commented 5 months ago

I am currently working on a different improvement which only does the None check for input data if it's not already a string. For cpdist this change improves the result to:

Figure_2

Aside from the matrix flattening the change looks good now.

Really looking forward to this update!

thomasryde commented 5 months ago

I am currently working on a different improvement which only does the None check for input data if it's not already a string. For cpdist this change improves the result to:

Figure_2

Aside from the matrix flattening the change looks good now.

Really looking forward to this update!

thomasryde commented 5 months ago

One thing that I thought was odd was that providing a dtype in cdist & cpdist provided no performance improvement. I could see in the code that this is because the calculation in fuzz functions are all computed using single or double precision (don't remember which). This could be a future improvement if we want to squeeze some more performance.

maxbachmann commented 5 months ago

One thing that I thought was odd was that providing a dtype in cdist & cpdist provided no performance improvement. I could see in the code that this is because the calculation in fuzz functions are all computed using single or double precision (don't remember which). This could be a future improvement if we want to squeeze some more performance.

Right now they are all calculated using double precision. However this is mostly just a couple of calculations. E.g. for fuzz.ratio it's:

(1.0 - dist / (len1 + len2)) * 100.0

The difference between float and double is probably not measurable there. On the flipside differentiating between the two would double the binary size. The dtype is mostly intended to allow users to reduce the memory consumption of the results. For cpdist this is probably not really needed, but for cdist this can make a large difference.

performance analysis

Looking at the runtime of cpdist with fuzz.ratio we spend about the following time on each string pair (on my CPU): 1) retrieving both strings out of the two lists: 130ns 2) comparing the two strings: 90ns

1 gets improved to 90ns with the None check improvement and I do have another local patch which appears to improve it down to 77ns. Both of them are still work in progress, so these numbers could change. It can't be parallelized unless you start multiple Python interpreters. 2 can use multiple threads to improve the performance. E.g. I got 45ns with 2 threads, 26ns with 4 threads and 16ns with 8 threads. cpdist could save some memory allocations since it doesn't require the cached scorers. Not sure how much of an improvement this would make for the runtime.

thomasryde commented 5 months ago

performance analysis

Looking very much forward to the above metioned performance improvement. I very much need the speed in my use-cases of the library.

maxbachmann commented 5 months ago

The PR looks good now. Thanks for the work you did put into it. I will probably get some more of my performance improvements in today as well.

maxbachmann commented 5 months ago

Looking very much forward to the above metioned performance improvement. I very much need the speed in my use-cases of the library.

depending on where you get the data from you might be able to lower even more of your implementation to C++. This could allow you to skip the conversion completely.