kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

replacement has 2 rows, data has 0 #17

Closed kuriwaki closed 6 years ago

kuriwaki commented 6 years ago

I'm getting an error when merging on two columns of names, and can't quite figure out why.

Here is a toy example that replicates the error. In this example increasing the number of observations solves the problem, but not in my real case. However, adding a third column seems to solve problem.

Would anyone know what's going on?

library(fastLink)

## setup toy data
nobs.a <- 30
set.seed(66455) # needs to be a particular draw to replicate error
dfA.0 <- data.frame(firstname = sample(c("JOHN", "GEORGE"), nobs.a, TRUE, c(0.7, 0.3)),
                    lastname = sample(c("MILLER", "HILL"), nobs.a, TRUE, c(0.7, 0.3)))

dfB.0 <- data.frame(firstname = rep(c("JOHN", "OLIVER", "CHARLES", "FRANCIS", "JOHN")),
                    lastname = rep(c("HILL", "YOUNG", "KEEL", "MCNEAL", "KOONS")))
## throws error 
fL.0 <- fastLink(dfA.0, dfB.0, varnames = c("firstname", "lastname"))
#> 
#> ==================== 
#> fastLink(): Fast Probabilistic Record Linkage
#> ==================== 
#> 
#> Calculating matches for each variable.
#> Getting counts for zeta parameters.
#> Running the EM algorithm.
#> Getting the indices of estimated matches.
#> Warning in min(em.obj$weights[em.obj$zeta.j >= l.t]): no non-missing
#> arguments to min; returning Inf
#> Deduping the estimated matches.
#> Error in `$<-.data.frame`(`*tmp*`, roworder, value = c(1L, 0L)): replacement has 2 rows, data has 0

## larger N solves the problem here, though it doesn't in my real data
dfA.1 <- data.frame(firstname = sample(c("JOHN", "GEORGE"), nobs.a*100, TRUE, c(0.7, 0.3)),
                    lastname = sample(c("MILLER", "HILL"), nobs.a*100, TRUE, c(0.7, 0.3)))
fL.1 <- fastLink(dfA.1, dfB.0, varnames = c("firstname", "lastname"))
#> 
#> ==================== 
#> fastLink(): Fast Probabilistic Record Linkage
#> ==================== 
#> 
#> Calculating matches for each variable.
#> Getting counts for zeta parameters.
#> Running the EM algorithm.
#> Getting the indices of estimated matches.
#> Deduping the estimated matches.

## adding a superflous third column solves the problem, even when largeN sample does not in my case
dfA.2 <- cbind(dfA.0, noise = c("noiseA", rep("noiseB", nobs.a - 1)))
dfB.2 <- cbind(dfB.0, noise = c("noiseA", rep("noiseB", nrow(dfB.0) - 1)))
fL.2 <- fastLink(dfA.2, dfB.2, varnames = c("firstname", "lastname", "noise"))
#> 
#> ==================== 
#> fastLink(): Fast Probabilistic Record Linkage
#> ==================== 
#> 
#> Calculating matches for each variable.
#> Getting counts for zeta parameters.
#> Running the EM algorithm.
#> Getting the indices of estimated matches.
#> Deduping the estimated matches.
tedenamorado commented 6 years ago

Hi Shiro,

The problem is that by default the cutoff to declare a pair of observations a match is 0.85. In your example, even if you agree on first name and last name, the maximum probability for a pair of record to be a match is just ~0.81.

Try:

fL.0 <- fastLink(dfA.0, dfB.0, varnames = c("firstname", "lastname"), threshold.match = 0.75)
kosukeimai commented 6 years ago

@tedenamorado Is it possible to issue an informative warning message?

kuriwaki commented 6 years ago

Thanks Ted! I was fiddling with cut.a but forgot about the more important threshold.match. This resolves the immediate question; and agree having that informative message at some point would be great.

tedenamorado commented 6 years ago

We will add that warning in our next release. Thanks for the feedback!