kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

"undefined columns selected" in dedupeMatched #10

Closed kuriwaki closed 7 years ago

kuriwaki commented 7 years ago

When dfA contains a irrelevant-for-matching variable that is not in dfB, dedupeMatches breaks here: https://github.com/kosukeimai/fastLink/blob/021fa0e435cf19074788f4d0640becdae4ca77d1/R/dedupeMatches.R#L267

It looks like specifying two types of colnames.df earlier in the code (instead of one based on matchesA) will prevent this from happening ?

MWE:

library(fastLink)
data(samplematch)

dfAextra <- data.frame(dfA, extra = 1:nrow(dfA)) # this var not used for matching
out <- fastLink(dfAextra, dfB, varnames = c("firstname", "middlename", "lastname"))

# ==================== 
# fastLink(): Fast Probabilistic Record Linkage
# ==================== 
# 
# Calculating matches for each variable.
# Getting counts for zeta parameters.
# Parallelizing gamma calculation using 1 cores.
# Running the EM algorithm.
# Getting the indices of estimated matches.
# Parallelizing gamma calculation using 1 cores.
# Deduping the estimated matches.
# Error in `[.data.frame`(x, r, vars, drop = drop) : 
#   undefined columns selected
bfifield commented 7 years ago

Thanks, Shiro! Just tested - your fix is totally correct. Pushed the commit to fix.