Closed thomasryde closed 5 months ago
I have made a benchmark script with the rapidfuzz loop vs. cpdist. Note that this is slightly different to the ones made for cdist where we look at fuzzywuzzy loop vs cdist. Also note that I have not done an apples to apples comparison due to the amount of workers set in the cpdist version.
I have made a benchmark script with the rapidfuzz loop vs. cpdist. Note that this is slightly different to the ones made for cdist where we look at fuzzywuzzy loop vs cdist.
I probably need to overhaul the benchmarks at some point. The comparison with fuzzywuzzy
has become a bit meaningless, since fuzzywuzzy
is developed under the name thefuzz
now. thefuzz
is basically just a compatibility wrapper around rapidfuzz
at this point and so is relatively close in performance.
Also note that I have not done an apples to apples comparison due to the amount of workers set in the cpdist version.
Yes seems like for ratio
and QRatio
it actually helps to switch less frequently between Python and C++. For the others it looks to be mostly the threading. Sadly a lot of the optimizations used in cdist
can't be applied when comparing individual elements.
I am actually seeing larger improvements on my laptop :+1:
I am currently working on a different improvement which only does the None check for input data if it's not already a string. For cpdist this change improves the result to:
Aside from the matrix flattening the change looks good now.
At least for the fastest metrics this still appears to be bottle-necked by the conversion of the Python list of strings into my internal strings. I don't think there is much room for improvement there unfortunately, since all this does is:
This part can't be parallelized either, since the CPython interpreter can only be called from a single thread.
I am currently working on a different improvement which only does the None check for input data if it's not already a string. For cpdist this change improves the result to:
Aside from the matrix flattening the change looks good now.
Really looking forward to this update!
I am currently working on a different improvement which only does the None check for input data if it's not already a string. For cpdist this change improves the result to:
Aside from the matrix flattening the change looks good now.
Really looking forward to this update!
One thing that I thought was odd was that providing a dtype in cdist & cpdist provided no performance improvement. I could see in the code that this is because the calculation in fuzz functions are all computed using single or double precision (don't remember which). This could be a future improvement if we want to squeeze some more performance.
One thing that I thought was odd was that providing a dtype in cdist & cpdist provided no performance improvement. I could see in the code that this is because the calculation in fuzz functions are all computed using single or double precision (don't remember which). This could be a future improvement if we want to squeeze some more performance.
Right now they are all calculated using double precision. However this is mostly just a couple of calculations. E.g. for fuzz.ratio
it's:
(1.0 - dist / (len1 + len2)) * 100.0
The difference between float and double is probably not measurable there. On the flipside differentiating between the two would double the binary size. The dtype is mostly intended to allow users to reduce the memory consumption of the results. For cpdist this is probably not really needed, but for cdist this can make a large difference.
Looking at the runtime of cpdist with fuzz.ratio
we spend about the following time on each string pair (on my CPU):
1) retrieving both strings out of the two lists: 130ns
2) comparing the two strings: 90ns
1 gets improved to 90ns with the None check improvement and I do have another local patch which appears to improve it down to 77ns. Both of them are still work in progress, so these numbers could change. It can't be parallelized unless you start multiple Python interpreters. 2 can use multiple threads to improve the performance. E.g. I got 45ns with 2 threads, 26ns with 4 threads and 16ns with 8 threads. cpdist could save some memory allocations since it doesn't require the cached scorers. Not sure how much of an improvement this would make for the runtime.
performance analysis
Looking very much forward to the above metioned performance improvement. I very much need the speed in my use-cases of the library.
The PR looks good now. Thanks for the work you did put into it. I will probably get some more of my performance improvements in today as well.
Looking very much forward to the above metioned performance improvement. I very much need the speed in my use-cases of the library.
depending on where you get the data from you might be able to lower even more of your implementation to C++. This could allow you to skip the conversion completely.
Resolve #344
We have made a function called cpdist which compares the distance between corresponding elements of 2 lists of strings of equal length. The PR aims to remove the overhead of calling a distance/similarity function from python every time but handling everything inside C++.
The functionality is similar to cdist except that we do not compute the similarity between the cross product of elements but the corresponding element in the 2 lists.