Closed juliangilbey closed 2 years ago
Oops, this is a bug against cjellyfish, not jellyfish.
Ah, cjellyfish doesn't have issues enabled, so I'll report it here. This comment from @maxbachmann is really concerning. Now having looked at the original code, I see that the cjellyfish code for at least this algorithm is essentially copied directly from the original Winkler implementation without any acknowledgement of this. Furthermore, since that code did not have any license statement, this implementation is likely to be a copyright infringement, unless the code was relicensed to you and/or Sunlight Labs. Please could you clarify the situation? Thanks!
It acknowledges that it copied the code:
/* borrowed heavily from strcmp95.c
* http://www.census.gov/geo/msb/stand/strcmp.c
*/
https://github.com/jamesturk/cjellyfish/blob/b90750ee0624515004eab4db5c0b2e3b45370bc2/jaro.c#L7-L9
Even though I am not sure about the original license (since it does not include a license statement)
The original code is a product of the federal government and licensed in the public domain. On Oct 30, 2021, 5:06 PM -0400, Max Bachmann @.***>, wrote:
It acknowledges that it copied the code: /* borrowed heavily from strcmp95.c
- http://www.census.gov/geo/msb/stand/strcmp.c */ https://github.com/jamesturk/cjellyfish/blob/b90750ee0624515004eab4db5c0b2e3b45370bc2/jaro.c#L7-L9 Even though I am not sure about the original license (since it does not include a license statement) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Ah, OK, thanks. That makes things a lot clearer. It would be good to include this information in the LICENSE file.
@jamesturk @juliangilbey note that this behavior stems from the original implementation of Winkler from here: https://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c.
The corresponding paper only states that it boosts the similarity for up to 4 common characters at the start of the strings: https://files.eric.ed.gov/fulltext/ED325505.pdf. He mentions, that the census data matched includes:
However since it is not explicitly mentioned and people could directly use the Jaro similarity when they do not want this behavior, I guess it is reasonable to drop this behavior.
Originally posted by @maxbachmann in https://github.com/jamesturk/jellyfish/issues/147#issuecomment-955586499