kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

Reproducing RLdata500 deduplication results #24

Closed 1danjordan closed 6 years ago

1danjordan commented 6 years ago

Hi,

First of all thank your work on fastLink. I've been trying to recreate your results from your paper:

Steorts (2015) reports that the FNR and FDR of her methodology are 0.02 and 0.04, respectively. When applying our algorithm, we use three categories for the string valued variables (first and last names), i.e., exact (or nearly identical) match, partial match, and disagreement, based on the Jaro-Winkler string distance with 0.94 and 0.88 as the cutpoints as recommended by Winkler (1990). For the numeric valued fields (day, month, and year of birth), we use a binary comparison, based on exact matches. Using fastLink, we found both FNR and FDR of our methodology to be exactly zero.

I've included a reproducible example below:

library(dplyr)
library(fastLink)
data("RLdata500", package = "RecordLinkage")

matching_vars <- names(RLdata500)
matches <- fastLink(
  dfA                     = RLdata500, 
  dfB                     = RLdata500,
  varnames           = matching_vars[c(1, 3, 4, 5, 6, 7)],
  stringdist.match = matching_vars[c(1, 3, 4)],
  partial.match      = matching_vars[c(1, 3, 4)]
  )

matches$matches
# inds.a inds.b
#1     59     59
#2    173    173
#3    219    219
#4    336    336
#5    360    360
#6    419    419
#7    455    455
#8    479    479

dups <- matches$matches %>% 
    filter(inds.a != inds.b) %>% 
    inner_join(rl_data, by = c("inds.a" = "n")) %>% 
    inner_join(rl_data, by = c("inds.b" = "n"))

As you can see I've made no changes to the default settings of fastLink, The rl_dups dataframe contains only exact matches and discovers no duplicates. Do you have any suggestions in order to recreate your results? I'm assuming I just need to tweak some parameters?

Thanks, Dan

tedenamorado commented 6 years ago

Hi Dan,

Thanks for using fastLink! Believe me, it is through the feedback of the users that we have been able to improve the package in very meaningful ways.

There is a typo in the paper, it should read: first and last name. In the RLdata500 dataset, there are two sources for last names. However, one of them (lname_c2) has missing values almost everywhere - there are only 8 observed values out of 500. We do not use that variable in our replication code. In the simulations settings, we have shown that when the amount of missing information is large then that can lead to problems - basically, the parameters of the model can be way off the truth.

However, even if you were to exclude that variable from your code, I just find out that the wrapper has a bug when performing a deduplication exercise. We will fix that soon and let you know when the wrapper has been fixed.

Wrapper aside, the following lines of code should reproduce the exercise we did in the paper. The code below follows the step-by-step procedure we describe here.

library('RecordLinkage')
RLdata500$id <- identity.RLdata500

library('fastLink')
## Create Agreement Vectors
g1 <- gammaCKpar(RLdata500$fname_c1, RLdata500$fname_c1, cut.a = 0.94, cut.p = 0.88)
g2 <- gammaCKpar(RLdata500$lname_c1, RLdata500$lname_c1, cut.a = 0.94, cut.p = 0.88)
g3 <- gammaKpar(RLdata500$by, RLdata500$by)
g4 <- gammaKpar(RLdata500$bm, RLdata500$bm)
g5 <- gammaKpar(RLdata500$bd, RLdata500$bd)
nr <- nrow(RLdata500)

## Count Patterns + EM
counts <- tableCounts(list(g1, g2, g3, g4, g5), nobs.a = nr, nobs.b = nr)
resEM <- emlinkMARmov(counts, nobs.a = nr, nobs.b = nr)

## Matches
matches <- matchesLink(list(g1, g2, g3, g4, g5), nobs.a = nr, nobs.b = nr, em = resEM, thresh = 0.85)

## Duplicates: there should be 600, 500 perfect matches + 100 duplicates 
## while there are only 50 duplicates in the data
## finding that row 1 in A is a duplicate of row 2 in B
## is equivalent to row 2 in A is a duplicate of row 1 in B
matches.1 <- RLdata500[matches$inds.a, ]
matches.2 <- RLdata500[matches$inds.b, ]

I hope the code above helps! If you have further questions, just let us know.

Ted

1danjordan commented 6 years ago

Hi Ted,

Thanks a million for your quick response! After a good bit of fiddling and reading, I realised that I wasn't using the fastLink wrapper correctly because I wasn't passing variables the birth date variables into the numeric.match argument. Doing this resulted in an error, here's the traceback:

data("RLdata500", package = "RecordLinkage")

# prep data 
rl_data <- RLdata500 %>% 
    as_tibble %>% 
    mutate_if(is.factor, as.character) %>% 
    mutate(n = row_number())

matching_vars     <- c("fname_c1", "lname_c1", "by", "bm", "bd")

rl_matches <- fastLink(
  dfA                = rl_data, 
  dfB                = rl_data,
  varnames           = c("fname_c1", "lname_c1", "by", "bm", "bd"),
  stringdist.match   = c("fname_c1", "lname_c1"),
  numeric.match      = c("by", "bm", "bd")
  )
Error in calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]) : 
  Not a matrix.
4.
stop(structure(list(message = "Not a matrix.", call = calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]), cppstack = NULL), .Names = c("message", "call", "cppstack"), class = c("Rcpp::not_a_matrix", "C++Error", "error", "condition")))
3.
calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]])
2.
dedupeMatches(matchesA = dfA[matches$inds.a, ], matchesB = dfB[matches$inds.b, ], EM = resultsEM, matchesLink = matches, varnames = varnames, stringdist.match = stringdist.match, numeric.match = numeric.match, partial.match = partial.match, linprog = linprog.dedupe, ...
1.
fastLink(dfA = rl_data, dfB = rl_data, varnames = c("fname_c1", "lname_c1", "by", "bm", "bd"), stringdist.match = c("fname_c1", "lname_c1"), numeric.match = c("by", "bm", "bd"))

I'm assuming that this is the error that you have run into?

Also, I'll go ahead and use the functions directly from the package like you've suggested. Thank you for the example above.

Cheers, Dan

tedenamorado commented 6 years ago

Exactly, Dan! We are working on fixing that issue. However, the code I posted does what we describe in the paper. When we wrote the paper we did not have a function to compare distances for numeric variables, now we have one.

We are constantly trying to incorporate new functions that help with record linkage projects, that is why, if you have any suggestions, do not hesitate to let us know.

Cheers!

Ted

tedenamorado commented 6 years ago

Hi Dan,

We have push a fix that solves the issue. Please, install fastLink again from GitHub and try the lines you wrote above.

If anything, please let us know.

Ted

katharinax commented 6 years ago

Hi,

I am a new user of fastLink. @tedenamorado, thank you for developing this very useful package, and @dandermotj, thank you for starting this active thread.

As listed below, I am still experiencing the two issues you mentioned. The fastLink version I am using is 0.3.1 published on 2018-02-01. Running on R version 3.4.4. I wasn't able to find a package version newer than this. Is it still under development?

  1. Passing in any numeric.match arguments will result in Error in calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]) : Not a matrix.

  2. Unable to return all matches without de-duplication. More specifically, when I specify return.all = FALSE, an error message pops up during the step for Calculating the posterior for each pair of matched observations, saying that Error in fix.by(by.x, x) : 'by' must match number of columns. However, when return.all is set to TRUE, I believe dedupe.matches gets overridden to TRUE as well.

Appreciate it.

Best, Katie

aalexandersson commented 6 years ago

@katharinax Katie, I assure you as an active user that fastLink is very much under active development by the developers. Use the latest development version on GitHub if you need something newer than the stable version on CRAN. If you still have a problem, please make sure to provide a reproducible example.

Anders

katharinax commented 6 years ago

@aalexandersson I see; will try the development version. And I'll provide reproducible examples if I have further questions. Thanks!

tedenamorado commented 6 years ago

@katharinax thanks for using fastLink. As noted above by @aalexandersson, we have fixed the issue on GitHub - we are planning to push a new version to CRAN soon with that fix included.

If you are using a PC or Linux machine, then installing from GitHub via devtools should be straightforward. If you are using a MAC, then installing from GitHub requires an additional step (happy to help if that is the case).

Please, keep us posted!

All the best,

Ted

katharinax commented 6 years ago

Hi @tedenamorado,

I am using a MAC and I'm actually not that familiar with R.

My current plan is to git clone your repo, and then link the source code to my own project by doing source("localGitRepoRootFolder/R/fastLink.R") Not sure if this sounds stupid, ha ;) Love to hear your suggestion!

Thanks, Katie

tedenamorado commented 6 years ago

Hi Katie,

The problem is that you need OpenMP to work on your Mac. The following is a fantastic explanation of how to make that happen:

http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/

The other thing that you might need to do is to update the command line tools

xcode-select --install

Hope this helps! If anything, let us know.

All the best,

Ted

katharinax commented 6 years ago

@tedenamorado This is very enlightening. Will look into this now. I am also downloading the devtools package. Thanks for all the pointers! Much appreciated!

Katie