Closed juliangilbey closed 2 years ago
Thanks for this, I'm not sure what that NOTNUM piece is supposed to be for either.
A PR would be great, it'd also be great to update the test data here to include an example.
OK, I've just made three PRs, one on each of jellyfish, cjellyfish and jellyfish-testdata. My local tests have been successful with these. I've made exactly the changes described above.
Best wishes, Julian
thanks for this, just released 0.8.9 with this and one other fix
Super, thanks! Amazingly quick response :-)
@jamesturk @juliangilbey note that this behavior stems from the original implementation of Winkler from here: https://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c.
The corresponding paper only states that it boosts the similarity for up to 4 common characters at the start of the strings: https://files.eric.ed.gov/fulltext/ED325505.pdf. He mentions, that the census data matched includes:
first name, middle initial, last name, house number, street name, rural route number, postal box number, conglomerated address, telephone number, age, sex, marital status, relationship to head of household, and race So I would assume that her came to the conclusion, that boosting the similarity based on the prefix does not improve the metric for things like telephone numbers.
However since it is not explicitly mentioned and people could directly use the Jaro similarity when they do not want this behavior, I guess it is reasonable to drop this behavior.
Hi,
In the Python implementation, there is the following code to determine the prefix-length for weight boosting: https://github.com/jamesturk/jellyfish/blob/ac6a64a1b37ba6ff648efe397951e5ed58d80f3f/jellyfish/_jellyfish.py#L104-L105
On the other hand, in the C implementation, we have: https://github.com/jamesturk/cjellyfish/blob/29f1f023584004c26ae3299b7c5f23ab57c0b905/jaro.c#L118:
Here, the
NOTNUM
macro excludes any characters which are digits. This seems a rather strange thing to do (and is an undocumented behaviour); why should a comparison of the strings "0" and "00" be treated differently from a comparison of the strings "@" and "@@", for example?As a result, the code
jellyfish.jaro_winkler_similarity("00", "0")
returns 0.833..., whereasjellyfish.jaro_winkler_similarity("00", "0")
gives 0.85.It looks as though the
and s1[i]
part in the Python might have been intended to perform a similar role, but it is actually redundant, as a length 1 string always evaluates toTrue
in a boolean context.My suggestion is to drop this special-casing completely: remove the
and s1[i]
from the Python version and the two uses ofNOTNUM
(there is a second use a little later, which is equally problematic) and its definition from the C version. I can make PRs for this if you would like.Best wishes,
Julian