jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

License issue #149

Closed juliangilbey closed 2 years ago

juliangilbey commented 2 years ago

@jamesturk @juliangilbey note that this behavior stems from the original implementation of Winkler from here: https://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c.

The corresponding paper only states that it boosts the similarity for up to 4 common characters at the start of the strings: https://files.eric.ed.gov/fulltext/ED325505.pdf. He mentions, that the census data matched includes:

first name, middle initial, last name, house number, street name, rural route number, postal box number, conglomerated address, telephone number, age, sex, marital status, relationship to head of household, and race So I would assume that her came to the conclusion, that boosting the similarity based on the prefix does not improve the metric for things like telephone numbers.

However since it is not explicitly mentioned and people could directly use the Jaro similarity when they do not want this behavior, I guess it is reasonable to drop this behavior.

Originally posted by @maxbachmann in https://github.com/jamesturk/jellyfish/issues/147#issuecomment-955586499

juliangilbey commented 2 years ago

Oops, this is a bug against cjellyfish, not jellyfish.

juliangilbey commented 2 years ago

Ah, cjellyfish doesn't have issues enabled, so I'll report it here. This comment from @maxbachmann is really concerning. Now having looked at the original code, I see that the cjellyfish code for at least this algorithm is essentially copied directly from the original Winkler implementation without any acknowledgement of this. Furthermore, since that code did not have any license statement, this implementation is likely to be a copyright infringement, unless the code was relicensed to you and/or Sunlight Labs. Please could you clarify the situation? Thanks!

maxbachmann commented 2 years ago

It acknowledges that it copied the code:

/* borrowed heavily from strcmp95.c
 *    http://www.census.gov/geo/msb/stand/strcmp.c
 */

https://github.com/jamesturk/cjellyfish/blob/b90750ee0624515004eab4db5c0b2e3b45370bc2/jaro.c#L7-L9

Even though I am not sure about the original license (since it does not include a license statement)

jamesturk commented 2 years ago

The original code is a product of the federal government and licensed in the public domain. On Oct 30, 2021, 5:06 PM -0400, Max Bachmann @.***>, wrote:

It acknowledges that it copied the code: /* borrowed heavily from strcmp95.c

juliangilbey commented 2 years ago

Ah, OK, thanks. That makes things a lot clearer. It would be good to include this information in the LICENSE file.