Closed 1danjordan closed 6 years ago
Hi Dan,
Thanks for using fastLink! Believe me, it is through the feedback of the users that we have been able to improve the package in very meaningful ways.
There is a typo in the paper, it should read: first and last name. In the RLdata500
dataset, there are two sources for last names. However, one of them (lname_c2
) has missing values almost everywhere - there are only 8 observed values out of 500. We do not use that variable in our replication code. In the simulations settings, we have shown that when the amount of missing information is large then that can lead to problems - basically, the parameters of the model can be way off the truth.
However, even if you were to exclude that variable from your code, I just find out that the wrapper has a bug when performing a deduplication exercise. We will fix that soon and let you know when the wrapper has been fixed.
Wrapper aside, the following lines of code should reproduce the exercise we did in the paper. The code below follows the step-by-step procedure we describe here.
library('RecordLinkage')
RLdata500$id <- identity.RLdata500
library('fastLink')
## Create Agreement Vectors
g1 <- gammaCKpar(RLdata500$fname_c1, RLdata500$fname_c1, cut.a = 0.94, cut.p = 0.88)
g2 <- gammaCKpar(RLdata500$lname_c1, RLdata500$lname_c1, cut.a = 0.94, cut.p = 0.88)
g3 <- gammaKpar(RLdata500$by, RLdata500$by)
g4 <- gammaKpar(RLdata500$bm, RLdata500$bm)
g5 <- gammaKpar(RLdata500$bd, RLdata500$bd)
nr <- nrow(RLdata500)
## Count Patterns + EM
counts <- tableCounts(list(g1, g2, g3, g4, g5), nobs.a = nr, nobs.b = nr)
resEM <- emlinkMARmov(counts, nobs.a = nr, nobs.b = nr)
## Matches
matches <- matchesLink(list(g1, g2, g3, g4, g5), nobs.a = nr, nobs.b = nr, em = resEM, thresh = 0.85)
## Duplicates: there should be 600, 500 perfect matches + 100 duplicates
## while there are only 50 duplicates in the data
## finding that row 1 in A is a duplicate of row 2 in B
## is equivalent to row 2 in A is a duplicate of row 1 in B
matches.1 <- RLdata500[matches$inds.a, ]
matches.2 <- RLdata500[matches$inds.b, ]
I hope the code above helps! If you have further questions, just let us know.
Ted
Hi Ted,
Thanks a million for your quick response! After a good bit of fiddling and reading, I realised that I wasn't using the fastLink
wrapper correctly because I wasn't passing variables the birth date variables into the numeric.match
argument. Doing this resulted in an error, here's the traceback:
data("RLdata500", package = "RecordLinkage")
# prep data
rl_data <- RLdata500 %>%
as_tibble %>%
mutate_if(is.factor, as.character) %>%
mutate(n = row_number())
matching_vars <- c("fname_c1", "lname_c1", "by", "bm", "bd")
rl_matches <- fastLink(
dfA = rl_data,
dfB = rl_data,
varnames = c("fname_c1", "lname_c1", "by", "bm", "bd"),
stringdist.match = c("fname_c1", "lname_c1"),
numeric.match = c("by", "bm", "bd")
)
Error in calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]) :
Not a matrix.
4.
stop(structure(list(message = "Not a matrix.", call = calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]), cppstack = NULL), .Names = c("message", "call", "cppstack"), class = c("Rcpp::not_a_matrix", "C++Error", "error", "condition")))
3.
calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]])
2.
dedupeMatches(matchesA = dfA[matches$inds.a, ], matchesB = dfB[matches$inds.b, ], EM = resultsEM, matchesLink = matches, varnames = varnames, stringdist.match = stringdist.match, numeric.match = numeric.match, partial.match = partial.match, linprog = linprog.dedupe, ...
1.
fastLink(dfA = rl_data, dfB = rl_data, varnames = c("fname_c1", "lname_c1", "by", "bm", "bd"), stringdist.match = c("fname_c1", "lname_c1"), numeric.match = c("by", "bm", "bd"))
I'm assuming that this is the error that you have run into?
Also, I'll go ahead and use the functions directly from the package like you've suggested. Thank you for the example above.
Cheers, Dan
Exactly, Dan! We are working on fixing that issue. However, the code I posted does what we describe in the paper. When we wrote the paper we did not have a function to compare distances for numeric variables, now we have one.
We are constantly trying to incorporate new functions that help with record linkage projects, that is why, if you have any suggestions, do not hesitate to let us know.
Cheers!
Ted
Hi Dan,
We have push a fix that solves the issue. Please, install fastLink
again from GitHub and try the lines you wrote above.
If anything, please let us know.
Ted
Hi,
I am a new user of fastLink
. @tedenamorado, thank you for developing this very useful package, and @dandermotj, thank you for starting this active thread.
As listed below, I am still experiencing the two issues you mentioned. The fastLink
version I am using is 0.3.1 published on 2018-02-01. Running on R version 3.4.4. I wasn't able to find a package version newer than this. Is it still under development?
Passing in any numeric.match
arguments will result in Error in calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]) : Not a matrix.
Unable to return all matches without de-duplication. More specifically, when I specify return.all = FALSE
, an error message pops up during the step for Calculating the posterior for each pair of matched observations
, saying that Error in fix.by(by.x, x) : 'by' must match number of columns.
However, when return.all
is set to TRUE
, I believe dedupe.matches
gets overridden to TRUE
as well.
Appreciate it.
Best, Katie
@katharinax Katie, I assure you as an active user that fastLink
is very much under active development by the developers. Use the latest development version on GitHub if you need something newer than the stable version on CRAN. If you still have a problem, please make sure to provide a reproducible example.
Anders
@aalexandersson I see; will try the development version. And I'll provide reproducible examples if I have further questions. Thanks!
@katharinax thanks for using fastLink
. As noted above by @aalexandersson, we have fixed the issue on GitHub - we are planning to push a new version to CRAN soon with that fix included.
If you are using a PC or Linux machine, then installing from GitHub via devtools
should be straightforward. If you are using a MAC, then installing from GitHub requires an additional step (happy to help if that is the case).
Please, keep us posted!
All the best,
Ted
Hi @tedenamorado,
I am using a MAC and I'm actually not that familiar with R.
My current plan is to git clone
your repo, and then link the source code to my own project by doing source("localGitRepoRootFolder/R/fastLink.R")
Not sure if this sounds stupid, ha ;) Love to hear your suggestion!
Thanks, Katie
Hi Katie,
The problem is that you need OpenMP
to work on your Mac. The following is a fantastic explanation of how to make that happen:
http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/
The other thing that you might need to do is to update the command line tools
xcode-select --install
Hope this helps! If anything, let us know.
All the best,
Ted
@tedenamorado This is very enlightening. Will look into this now. I am also downloading the devtools
package. Thanks for all the pointers! Much appreciated!
Katie
Hi,
First of all thank your work on fastLink. I've been trying to recreate your results from your paper:
I've included a reproducible example below:
As you can see I've made no changes to the default settings of
fastLink
, Therl_dups
dataframe contains only exact matches and discovers no duplicates. Do you have any suggestions in order to recreate your results? I'm assuming I just need to tweak some parameters?Thanks, Dan