rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Inconsistent comparison between empty string and None/NA #387

Closed 153957 closed 3 months ago

153957 commented 3 months ago

There appears to be a bug in how empty strings, None, and pandas.NA are handled when they are either the first or second argument:

>>> cdist(['example1', '', None, pandas.NA], ['example1', '', None, pandas.NA])
array([[100.,   0.,   0.,   0.],
       [  0., 100.,   0.,   0.],
       [  0., 100.,   0.,   0.],
       [  0., 100.,   0.,   0.]], dtype=float32)

(also tested with DamerauLevenshtein.normalized_similarity as scorer, same inconsistency) I would expect the result to be symmetric.

Note, when the array is stored in a variable rapidfuzz uses the 'single_list' function which makes the result symmetric, with the result scores that I would expect.

>>> data = ['example1', '', None, pandas.NA]
>>> cdist(data, data)
array([[100.,   0.,   0.,   0.],
       [  0., 100.,   0.,   0.],
       [  0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.]], dtype=float32)

Directly calling a scorer gives the expected results:

>>> DamerauLevenshtein.normalized_similarity(None, '')
0.0
>>> DamerauLevenshtein.normalized_similarity('', None)
0.0
maxbachmann commented 3 months ago

It appears to work fine on my machine:

>>> from rapidfuzz.process import cdist
>>> import pandas
>>> cdist(['example1', '', None, pandas.NA], ['example1', '', None, pandas.NA])
array([[100.,   0.,   0.,   0.],
       [  0., 100.,   0.,   0.],
       [  0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.]], dtype=float32)

Are you using the latest version of rapidfuzz:

>>> rapidfuzz.__version__
'3.9.2'

And on which platform + Python version?

153957 commented 3 months ago

On my Apple Silicon (ARM) Mac in a Python 3.12 Docker container.

$ python --version
Python 3.12.2
$ pip freeze | grep rapid
rapidfuzz==3.9.1

Just upgraded to Rapidfuzz 3.9.2, and I can still reproduce the same issue.

Let me know if you'd like me to run any other tests.

maxbachmann commented 3 months ago

I had a look into the different code paths. There was a bug in the non SIMD branch. This should be fixed in v3.9.3

153957 commented 3 months ago

Thank you! it works as expected now.