kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

cut.a for stringdist.method == "lv" #71

Closed wbakerrobinson closed 1 year ago

wbakerrobinson commented 1 year ago

Hello, I assume that "lv" is Levenshtein distance. My understanding is that values of Levenshtein distance range from 0 to the hamming distance given the two strings are of equal length. In this case I am comparing standardized dates of birth by "lv" so this range should hold. The documentation states the cut.a is the lower bound for full string-distance match, ranging between 0 and 1. If I am trying to treat Levenshtein distances of 0 or 1 as matches, what would I input for the cut.a argument? If I would input a number between 0 and 1, how is the Levenshtein distance mapped to that range?

Thanks! Will

tedenamorado commented 1 year ago

Hi Will,

Yes, "lv" refers to Levenshtein. However, its values have been adjusted to fit within the range of 0 to 1. A value of 0 indicates that the strings are different, while a value of 1 means they are identical. By default, the threshold for agreement (cut.a) is set at 0.94, but you can modify this value if you find it too strict or too lenient.

If anything, please don't hesitate to let us know.

All my best,

Ted

wbakerrobinson commented 1 year ago

Hi Ted, Thank you for the quick response. Can you tell me how you map values from Levenshtein distance to the range [0, 1]? In the examples below what would a Levenshtein distance of 1 map to? What would a Levenshtein distance of 2 map to?

library(stringdist)
dobA <- "1900-01-01"
dobB <- "1900-01-01"
dobC <- "1900-02-01"
dobD <- "1901-02-01"
dobE <- "3333 33 33"

# How do you map these values from [0, 1]?
# Returns 0 maps to 1 by fastLink
stringdist(dobA, dobB, method = "lv")

# Returns 1 maps to ?
stringdist(dobA, dobC, method = "lv")

# Returns 2 maps to ?
stringdist(dobA, dobD, method = "lv")

# Returns 10 maps to 0 by fastLink
stringdist(dobA, dobE, method = "lv")

Thanks, Will

aalexandersson commented 1 year ago

Disclaimer: I am a regular fastLink user, not a fastLink developer.

Are you interested in partial string matches? If yes, see the fastLink function gammaCKpar for the mapping. If no, see the fastLink function gammaCK2par for the mapping.

Caution: I do not recommend using "lv" with string variable dob because "lv" assumes that every character is equally important. Usually this assumption is not true with dob. In the example 1900-01-01 (yyyy-mm-dd), a 1-character change to 1920-01-01 (a 20-year change) is usually more important than the change to 1900-01-21 (a 20-day change). For partial matching, I instead recommend using the numeric variable age. For exact matching, dob works well.

wbakerrobinson commented 1 year ago

I looked into the function gammaCKpar, and found the following: lv = 1 - (stringdist(stringA, stringB, method = "lv") * 1/max(length(stringA), length(stringB))) For example 1 above: lv = 1 - (1 x 1/10) lv = 0.9 For example 2 above: lv = 1 - (2 x 1/10) lv = 0.8

I appreciate your unsolicited feedback on use of matching variables, but I am merely trying to replicate a linkage done by my coworker in another linkage software. The other linkage software has a string comparison that allows for "typos", and this seems to be most closely replicated in fastLink by the use of "lv". Now that I understand how the "lv" is mapped to [0,1] I can set a threshold which allows for a certain number of differences. Some may also find this helpful for a field like zip code.