Closed stephantul closed 4 years ago
To be honest, I was confused by this a few weeks ago. So maybe renaming it would be helpful.
EDIT: Ok jaro_distance is correct since 0 means absolute dissimilar and 1 absolute similar. For the jaro similarity it should be vice versa.
@Niecke I don't think that's true. Here's a simple test that shows that the logic is inverted.
import numpy as np
from jellyfish import levenshtein_distance, jaro_distance
a = "aabb"
b = "abab"
# We expect the output to be 1, since a has a smaller distance to itself than to something else
levenshtein = np.argmin([levenshtein_distance(a, b), levenshtein_distance(a, a)])
# Jaro gives the opposite result.
jaro = np.argmin([jaro_distance(a, b), jaro_distance(a, a)])
# But inverting it gives us the correct answer
jaro_inv = np.argmin([1 - jaro_distance(a, b), 1 - jaro_distance(a, a)])
This shows that the jaro_distance
function is, in fact, doing the opposite of what it says it is doing, and is calculating a similarity. Effectively, switching levenshtein_distance
to jaro_distance
requires you to switch around all logic statements (i.e., every min
becomes max
and vice versa).
looking at ways to do this without breaking everyone, it'll probably require a major version upgrade to be final, but I could add the renamed function in the interim
as of 0.8.2 this is now the correct name, the old names still function but raise a warning
I will fix the _distance variants in 1.0
The
jaro_distance
function returns a similarity (1 for exact matches) instead of a distance (0 for exact matches), which is the opposite of what the name implies. Would it be possible to rename the function to readjaro_similarity
instead to avoid confusion?