kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

Small sample errors #34

Closed sysilviakim closed 5 years ago

sysilviakim commented 5 years ago

Hello all,

Thank you for the amazing package again. fastLink works very well with moderate to large sized datasets---when faced with extremely small sample, it sometimes breaks down. For instance, with the commit on Sep 5, 2018 on gamma functions, the following works:

library(fastLink)
data(samplematch)
## One underlying true match
matches.out <- fastLink(
  dfA = dfA[c(1, 3), ], dfB = dfA[c(1, 2), ], 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname")
)

The following still gives an indexing error (no underlying true matches):

## No underlying true match
matches.out <- fastLink(
  dfA = dfA[c(1, 3), ], dfB = dfA[c(4, 2), ], 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname")
)

and in case that there happens to be only single observations in both dfA and dfB, ncol(patterns) - 1 is not correctly recognized from function emlinkMARmov (line 7), and the following also breaks:

## One underlying true match
matches.out <- fastLink(
  dfA = dfA[1, ], dfB = dfA[1, ], 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname") 
) 

As for the last case, it works when dfA = dfA[1, ] and dfB = dfA[c(1, 2), ], so I really don't know what's the issue here---it'll be great if there's a warning and an empty output with the same structure instead of failing, since the last setup doesn't make sense for a probabilistic matching anyway. Such small samples sometimes come up in a dynamic setting.

Sincerely, Silvia

bfifield commented 5 years ago

Thanks, Sylvia! This is a good catch - an edge case that as you mention will come up a lot with dynamic merges. We've added handling for this case in the EM in the newest commit, and will make this part of the newest submit to CRAN.