Closed 153957 closed 3 months ago
It appears to work fine on my machine:
>>> from rapidfuzz.process import cdist
>>> import pandas
>>> cdist(['example1', '', None, pandas.NA], ['example1', '', None, pandas.NA])
array([[100., 0., 0., 0.],
[ 0., 100., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]], dtype=float32)
Are you using the latest version of rapidfuzz:
>>> rapidfuzz.__version__
'3.9.2'
And on which platform + Python version?
On my Apple Silicon (ARM) Mac in a Python 3.12 Docker container.
$ python --version
Python 3.12.2
$ pip freeze | grep rapid
rapidfuzz==3.9.1
Just upgraded to Rapidfuzz 3.9.2, and I can still reproduce the same issue.
Let me know if you'd like me to run any other tests.
I had a look into the different code paths. There was a bug in the non SIMD branch. This should be fixed in v3.9.3
Thank you! it works as expected now.
There appears to be a bug in how empty strings,
None
, andpandas.NA
are handled when they are either the first or second argument:(also tested with
DamerauLevenshtein.normalized_similarity
as scorer, same inconsistency) I would expect the result to be symmetric.Note, when the array is stored in a variable rapidfuzz uses the 'single_list' function which makes the result symmetric, with the result scores that I would expect.
Directly calling a scorer gives the expected results: