kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

window blocking performance #70

Open bengoehring opened 1 year ago

bengoehring commented 1 year ago

Hello,

Thank you for making and maintaining such a helpful package.

I am reaching out with a conceptual question about the window blocking option in blockData --- and a possible performance improvement suggestion. This all stems from trying and failing to window block a dataset with a few million rows and a dataset with about 10 million rows using a cluster with 10 cores and 180GB of RAM. It timed out after 24 hours.

Based on the documentation of window blocking (i.e., "a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within ±K of the value of the same variable in dataset A"), I would expect the window blocking option to return a list of N lists --- where N refers to the number of unique values of the window blocking variable in dataset A. Each of the N lists will then contain two vectors of indices. The first vector will include the indices of dataset A where the window blocking variable equals the nth unique value of the window blocking variable in dataset A. The second vector will include the indices of dataset B where the window blocking variable is +/- K the nth unique value of the window blocking variable in dataset A.

I hope that makes sense.

It appears, however, that the window blocking option is doing something different. For instance, If I run:

library(tidyverse)
library(microbenchmark)
library(fastLink)

# make test datasets for window blocking
test_a <- diamonds
test_b <- arrange(diamonds, cut)

window_fast <- blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth')

length(unique(test_a$depth))
#84

length(window_fast)
#3390

The number of separate blocks (3390) is much higher than I would expect (184). Would you be able to expand upon where I am misunderstanding?

I am guessing I am just misunderstanding something, but if the logic above is (miraculously) correct, I went ahead and implemented it in a separate function. It appears that it outperforms the default window blocking option in terms of speed (~50 times faster in this example). Please just let me know If I am onto something and you would like me to submit a pull request. My apologies if I am totally off base with this!!

my_window_blocking <- function(data_a,
                               data_b,
                               window_blocking_var,
                               window_size = 1) {

  # return the vector containing the window blocking variable in each dataset 
  data_a_window_values <- pull(select(data_a,
                                      {{window_blocking_var}}))
  data_b_window_values <- pull(select(data_b,
                                      {{window_blocking_var}}))

  # unique values of the first datasets window blocking variable for looping
  data_a_unique_vals <- sort(unique(data_a_window_values))

  # find a and b indices 
  out_a_indices <- vector('list',
                          length = length(data_a_unique_vals))
  out_b_indices <- vector('list',
                          length = length(data_a_unique_vals))

  for(i in 1:length(data_a_unique_vals)) {
    # a indices
    out_a_indices[[i]] <- which(data_a_window_values == data_a_unique_vals[i])

    # b indices
    min_window_value <- data_a_unique_vals[i] - window_size
    max_window_value <- data_a_unique_vals[i] + window_size

    out_b_indices[[i]] <- which(data_b_window_values %in% seq(min_window_value, 
                                                              max_window_value))
  }

  # return final data
  final_list <- vector('list',
                       length = length(data_a_unique_vals))

  for (k in 1:length(data_a_unique_vals)) {
    final_list[[k]][[1]] <- out_a_indices[[k]]
    final_list[[k]][[2]] <- out_b_indices[[k]]

    names(final_list[[k]]) <- c('dfA.inds', 
                                'dfB.inds')
  }
  return(final_list)
}

microbenchmark(my_window_blocking(test_a, 
                                  test_b,
                                  depth),
               blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth'),
               times = 10)

Thank you for your time and all of your hard work maintaining this great package. It is much appreciated.

Best, Ben

tedenamorado commented 1 year ago

Thanks so much for sharing this with us! We will try your function but feel free to make a pull request.

In blockData we are using a function that goes after not the unique values but the unique pairs of values, that is why we end up with more windows, but blocking operations are linear, not quadratic, so your approach is better.

We are revising many of our functions and I will make sure this issue is addressed in our next release.

As always, if anything, do not hesitate to let us know.

Ted