kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Extracting matches when using blocking #63

Closed jamesmartherus closed 1 year ago

jamesmartherus commented 1 year ago

Hi there, I am trying to identify duplicates in a large dataset. I am blocking on several variables, aggregating with aggregateEM() and then trying to extract the matches with getMatches(). It looks like getMatches() won't work with the fastLink.aggregate class. Is there some other way to get the same functionality?

Reprex:

library(fastLink)
library(foreach)

data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
                   age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))

blocks <- blockData(data, data, varnames = c("gender"))

tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2, 
                                  type = 'PSOCK')  
doParallel::registerDoParallel(tmp_clus)

em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
  {
    library(fastLink)
    data_block <- data[blocks[[i]]$dfA.inds,]

    fastLink(
      dfA = data_block, dfB = data_block, 
      varnames = c("gender", "age")
      )
  }
parallel::stopCluster(tmp_clus)

em_aggregated <- aggregateEM(em_list)

data_dedupe <- getMatches(dfA = data, dfB = data,
                          fl.out = em_aggregated)

# Error in getMatches(dfA = data, dfB = data, fl.out = em_aggregated) : 
#   dfA and dfB are identical, but fl.out is not of class 'fastLink.dedupe.' Please check your inputs.
bengoehring commented 1 year ago

I have a similar question. I figured out a work around by using the number of indices in each block to figure out which block corresponds to which value of the blocking variable. From there, I found the matches within each block and then binded them together. This really only works when the blocking variable(s) only have a small number of unique values. It would great to have a more systematic option.

Thank you for making and maintaining such a great package.

(This is the sample example as above with my approach pasted at the bottom -- note that I dropped gender from varnames to make the merge work.)


library(fastLink)
library(foreach)
library(tidyverse)

data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
                   age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))

blocks <- blockData(data, data, varnames = c("gender"))

tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2, 
                                  type = 'PSOCK')  
doParallel::registerDoParallel(tmp_clus)

em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
  {
    library(fastLink)
    data_block <- data[blocks[[i]]$dfA.inds,]

    fastLink(
      dfA = data_block, dfB = data_block, 
      varnames = c("age")
    )
  }
parallel::stopCluster(tmp_clus)

data %>% 
  group_by(gender) %>% 
  summarise(n = n())
pluck(em_list, 1, 'nobs.a')
pluck(em_list, 2, 'nobs.a')
# the first value in em_list corresponds to the gender 1 block
# the second value in em_list corresponds to the gender 2 block

matches_1 <- getMatches(filter(data, gender == 1), 
                        filter(data, gender == 1), 
                        em_list[[1]]) %>% 
  mutate(dedupe.ids = str_c("gender_1_", 
                            dedupe.ids))
matches_2 <- getMatches(filter(data, gender == 2), 
                        filter(data, gender == 2), 
                        em_list[[2]]) %>% 
  mutate(dedupe.ids = str_c("gender_2_", 
                            dedupe.ids))

all_matches <- rbind(matches_1,
                     matches_2)
aalexandersson commented 1 year ago

Disclaimer: I am a regular fastLink user, not a fastLink developer.

@jamesmartherus @bengoehring Are you both "merely" asking how to extract matches when using blocking? I know how to do that. But I am not sure how to relate to your very specific code, which seems complicated and convoluted to me.

bengoehring commented 1 year ago

Yes.

aalexandersson commented 1 year ago

I wrote "merely" in quotation marks because this is a known issue that the fastLink developers are working on.

Ted helped me with my similar question a few years ago, and thanks to him I regularly use code similar to the example below. It should be much simpler to do this in fastLink, and maybe I messed up something. But, as asked for, the sample code extracts matches when using blocking (3 blocks in my example). I also added comments, and code for the confusion table because that is a related known issue when using blocking.

library(fastLink)
data(samplematch)

df1 <- dfA
df2 <- dfB

# blocking
block_out <- blockData(df1, df2, 
    varnames = c("firstname"),
    kmeans.block = "firstname", nclusters = 3)  # 3 blocks

# linkage
linkvars <- c("firstname", "lastname", "housenum", "streetname", "birthyear")  #
gammas <- c("gamma.1", "gamma.2", "gamma.3", "gamma.4", "gamma.5")

# Loop over blocks and merge 
match_out <- vector(mode = "list", length = length(block_out))
flobj_out <- vector(mode = "list", length = length(block_out))

for (i in 1:length(block_out)){
  print(paste("Block number is", i))

  # Subset data
  sub1 <- df1[block_out[[i]]$dfA.inds,]
  sub2   <- df2[block_out[[i]]$dfB.inds,]

  # Run fastLink
  hide <- capture.output(fl_out <- fastLink(
    dfA = sub1, dfB = sub2,
    varnames = linkvars,   # 
    return.all = TRUE))  #

  # Get matches, store
  match_out[[i]] <- getMatches(
    dfA = sub1, dfB = sub2, fl.out = fl_out,
    threshold.match = 0.95, combine.dfs = FALSE)  # NB 0.95
  flobj_out[[i]] <- fl_out
}
saveRDS(flobj_out, file="flobj_out.rds") # save object as file

# Extract matches
match_df1 <- do.call("rbind", lapply(match_out, "[[", "dfA.match"))
match_df2 <- do.call("rbind", lapply(match_out, "[[", "dfB.match"))

# confusion table
out <- readRDS("flobj_out.rds")  
confusion(out, threshold = 0.95)

The confusion table of the example should look like this:

$confusion.table
                     'True' Matches 'True' Non-Matches
Declared Matches               50.0                0.0
Declared Non-Matches            0.3              299.7

$addition.info
                                results
Max Number of Obs to be Matched  350.00
Sensitivity (%)                   99.40
Specificity (%)                  100.00
Positive Predicted Value (%)     100.00
Negative Predicted Value (%)      99.90
False Positive Rate (%)            0.00
False Negative Rate (%)            0.60
Correctly Classified (%)          99.91
F1 Score (%)                      99.70