kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

dedupeMatches() fails on single-variable matches #13

Closed tpaskhalis closed 7 years ago

tpaskhalis commented 7 years ago

Thank you for the package! It was released very timely for my work.

The issue I faced when trying on my datasets was that comparison on only one string variable would fail. The same error can be replicated with the test datasets:

> library(fastLink)
> data(samplematch)
> matches.out <- fastLink(
+   dfA = dfA, dfB = dfB, 
+   varnames = c("firstname"),
+   stringdist.match = c("firstname"),
+   partial.match = c("firstname")
+ )

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

Calculating matches for each variable.
Getting counts for zeta parameters.
(Using OpenMP to parallelize calculation. 1 threads out of 4 are used.)
Running the EM algorithm.
Getting the indices of estimated matches.
(Using OpenMP to parallelize calculation. 1 threads out of 4 are used.)
Deduping the estimated matches.

 Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column 

The culprit is line 158 in tableCounts.R which coerces data.frame into vector when only one column is left after dropping counts.

It should be:

na.data.new <- data.new.1[, -c(nc), drop = FALSE]

Maybe some warning for single-variable matches would also be useful.