jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

Inconsistent handling of digits in Python versus C implementation of Jaro-Winkler #147

Closed juliangilbey closed 2 years ago

juliangilbey commented 2 years ago

Hi,

In the Python implementation, there is the following code to determine the prefix-length for weight boosting: https://github.com/jamesturk/jellyfish/blob/ac6a64a1b37ba6ff648efe397951e5ed58d80f3f/jellyfish/_jellyfish.py#L104-L105

On the other hand, in the C implementation, we have: https://github.com/jamesturk/cjellyfish/blob/29f1f023584004c26ae3299b7c5f23ab57c0b905/jaro.c#L118:

        for (i=0; ((i<j) && (ying[i] == yang[i]) && (NOTNUM(ying[i]))); i++);

Here, the NOTNUM macro excludes any characters which are digits. This seems a rather strange thing to do (and is an undocumented behaviour); why should a comparison of the strings "0" and "00" be treated differently from a comparison of the strings "@" and "@@", for example?

As a result, the code jellyfish.jaro_winkler_similarity("00", "0") returns 0.833..., whereas jellyfish.jaro_winkler_similarity("00", "0") gives 0.85.

It looks as though the and s1[i] part in the Python might have been intended to perform a similar role, but it is actually redundant, as a length 1 string always evaluates to True in a boolean context.

My suggestion is to drop this special-casing completely: remove the and s1[i] from the Python version and the two uses of NOTNUM (there is a second use a little later, which is equally problematic) and its definition from the C version. I can make PRs for this if you would like.

Best wishes,

Julian

jamesturk commented 2 years ago

Thanks for this, I'm not sure what that NOTNUM piece is supposed to be for either.

A PR would be great, it'd also be great to update the test data here to include an example.

juliangilbey commented 2 years ago

OK, I've just made three PRs, one on each of jellyfish, cjellyfish and jellyfish-testdata. My local tests have been successful with these. I've made exactly the changes described above.

Best wishes, Julian

jamesturk commented 2 years ago

thanks for this, just released 0.8.9 with this and one other fix

juliangilbey commented 2 years ago

Super, thanks! Amazingly quick response :-)

maxbachmann commented 2 years ago

@jamesturk @juliangilbey note that this behavior stems from the original implementation of Winkler from here: https://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c.

The corresponding paper only states that it boosts the similarity for up to 4 common characters at the start of the strings: https://files.eric.ed.gov/fulltext/ED325505.pdf. He mentions, that the census data matched includes:

first name, middle initial, last name, house number, street name, rural route number, postal box number, conglomerated address, telephone number, age, sex, marital status, relationship to head of household, and race So I would assume that her came to the conclusion, that boosting the similarity based on the prefix does not improve the metric for things like telephone numbers.

However since it is not explicitly mentioned and people could directly use the Jaro similarity when they do not want this behavior, I guess it is reasonable to drop this behavior.