Closed jamesmartherus closed 1 year ago
I have a similar question. I figured out a work around by using the number of indices in each block to figure out which block corresponds to which value of the blocking variable. From there, I found the matches within each block and then binded them together. This really only works when the blocking variable(s) only have a small number of unique values. It would great to have a more systematic option.
Thank you for making and maintaining such a great package.
(This is the sample example as above with my approach pasted at the bottom -- note that I dropped gender from varnames to make the merge work.)
library(fastLink)
library(foreach)
library(tidyverse)
data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))
blocks <- blockData(data, data, varnames = c("gender"))
tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2,
type = 'PSOCK')
doParallel::registerDoParallel(tmp_clus)
em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
{
library(fastLink)
data_block <- data[blocks[[i]]$dfA.inds,]
fastLink(
dfA = data_block, dfB = data_block,
varnames = c("age")
)
}
parallel::stopCluster(tmp_clus)
data %>%
group_by(gender) %>%
summarise(n = n())
pluck(em_list, 1, 'nobs.a')
pluck(em_list, 2, 'nobs.a')
# the first value in em_list corresponds to the gender 1 block
# the second value in em_list corresponds to the gender 2 block
matches_1 <- getMatches(filter(data, gender == 1),
filter(data, gender == 1),
em_list[[1]]) %>%
mutate(dedupe.ids = str_c("gender_1_",
dedupe.ids))
matches_2 <- getMatches(filter(data, gender == 2),
filter(data, gender == 2),
em_list[[2]]) %>%
mutate(dedupe.ids = str_c("gender_2_",
dedupe.ids))
all_matches <- rbind(matches_1,
matches_2)
Disclaimer: I am a regular fastLink user, not a fastLink developer.
@jamesmartherus @bengoehring Are you both "merely" asking how to extract matches when using blocking? I know how to do that. But I am not sure how to relate to your very specific code, which seems complicated and convoluted to me.
Yes.
I wrote "merely" in quotation marks because this is a known issue that the fastLink
developers are working on.
Ted helped me with my similar question a few years ago, and thanks to him I regularly use code similar to the example below. It should be much simpler to do this in fastLink
, and maybe I messed up something. But, as asked for, the sample code extracts matches when using blocking (3 blocks in my example). I also added comments, and code for the confusion table because that is a related known issue when using blocking.
library(fastLink)
data(samplematch)
df1 <- dfA
df2 <- dfB
# blocking
block_out <- blockData(df1, df2,
varnames = c("firstname"),
kmeans.block = "firstname", nclusters = 3) # 3 blocks
# linkage
linkvars <- c("firstname", "lastname", "housenum", "streetname", "birthyear") #
gammas <- c("gamma.1", "gamma.2", "gamma.3", "gamma.4", "gamma.5")
# Loop over blocks and merge
match_out <- vector(mode = "list", length = length(block_out))
flobj_out <- vector(mode = "list", length = length(block_out))
for (i in 1:length(block_out)){
print(paste("Block number is", i))
# Subset data
sub1 <- df1[block_out[[i]]$dfA.inds,]
sub2 <- df2[block_out[[i]]$dfB.inds,]
# Run fastLink
hide <- capture.output(fl_out <- fastLink(
dfA = sub1, dfB = sub2,
varnames = linkvars, #
return.all = TRUE)) #
# Get matches, store
match_out[[i]] <- getMatches(
dfA = sub1, dfB = sub2, fl.out = fl_out,
threshold.match = 0.95, combine.dfs = FALSE) # NB 0.95
flobj_out[[i]] <- fl_out
}
saveRDS(flobj_out, file="flobj_out.rds") # save object as file
# Extract matches
match_df1 <- do.call("rbind", lapply(match_out, "[[", "dfA.match"))
match_df2 <- do.call("rbind", lapply(match_out, "[[", "dfB.match"))
# confusion table
out <- readRDS("flobj_out.rds")
confusion(out, threshold = 0.95)
The confusion table of the example should look like this:
$confusion.table
'True' Matches 'True' Non-Matches
Declared Matches 50.0 0.0
Declared Non-Matches 0.3 299.7
$addition.info
results
Max Number of Obs to be Matched 350.00
Sensitivity (%) 99.40
Specificity (%) 100.00
Positive Predicted Value (%) 100.00
Negative Predicted Value (%) 99.90
False Positive Rate (%) 0.00
False Negative Rate (%) 0.60
Correctly Classified (%) 99.91
F1 Score (%) 99.70
Hi there, I am trying to identify duplicates in a large dataset. I am blocking on several variables, aggregating with
aggregateEM()
and then trying to extract the matches withgetMatches()
. It looks likegetMatches()
won't work with thefastLink.aggregate
class. Is there some other way to get the same functionality?Reprex: