jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

rename jaro_distance to jaro_similarity #122

Closed stephantul closed 4 years ago

stephantul commented 5 years ago

The jaro_distance function returns a similarity (1 for exact matches) instead of a distance (0 for exact matches), which is the opposite of what the name implies. Would it be possible to rename the function to read jaro_similarity instead to avoid confusion?

Niecke commented 5 years ago

To be honest, I was confused by this a few weeks ago. So maybe renaming it would be helpful.

EDIT: Ok jaro_distance is correct since 0 means absolute dissimilar and 1 absolute similar. For the jaro similarity it should be vice versa.

stephantul commented 5 years ago

@Niecke I don't think that's true. Here's a simple test that shows that the logic is inverted.

import numpy as np
from jellyfish import levenshtein_distance, jaro_distance

a = "aabb"
b = "abab"

# We expect the output to be 1, since a has a smaller distance to itself than to something else
levenshtein = np.argmin([levenshtein_distance(a, b), levenshtein_distance(a, a)])
# Jaro gives the opposite result.
jaro = np.argmin([jaro_distance(a, b), jaro_distance(a, a)])
# But inverting it gives us the correct answer
jaro_inv = np.argmin([1 - jaro_distance(a, b), 1 - jaro_distance(a, a)])

This shows that the jaro_distance function is, in fact, doing the opposite of what it says it is doing, and is calculating a similarity. Effectively, switching levenshtein_distance to jaro_distance requires you to switch around all logic statements (i.e., every min becomes max and vice versa).

jamesturk commented 4 years ago

looking at ways to do this without breaking everyone, it'll probably require a major version upgrade to be final, but I could add the renamed function in the interim

jamesturk commented 4 years ago

as of 0.8.2 this is now the correct name, the old names still function but raise a warning

I will fix the _distance variants in 1.0