window blocking performance

Hello,

Thank you for making and maintaining such a helpful package.

I am reaching out with a conceptual question about the window blocking option in blockData --- and a possible performance improvement suggestion. This all stems from trying and failing to window block a dataset with a few million rows and a dataset with about 10 million rows using a cluster with 10 cores and 180GB of RAM. It timed out after 24 hours.

Based on the documentation of window blocking (i.e., "a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within ±K of the value of the same variable in dataset A"), I would expect the window blocking option to return a list of N lists --- where N refers to the number of unique values of the window blocking variable in dataset A. Each of the N lists will then contain two vectors of indices. The first vector will include the indices of dataset A where the window blocking variable equals the nth unique value of the window blocking variable in dataset A. The second vector will include the indices of dataset B where the window blocking variable is +/- K the nth unique value of the window blocking variable in dataset A.

I hope that makes sense.

It appears, however, that the window blocking option is doing something different. For instance, If I run:

library(tidyverse)
library(microbenchmark)
library(fastLink)

# make test datasets for window blocking
test_a <- diamonds
test_b <- arrange(diamonds, cut)

window_fast <- blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth')

length(unique(test_a$depth))
#84

length(window_fast)
#3390

The number of separate blocks (3390) is much higher than I would expect (184). Would you be able to expand upon where I am misunderstanding?

I am guessing I am just misunderstanding something, but if the logic above is (miraculously) correct, I went ahead and implemented it in a separate function. It appears that it outperforms the default window blocking option in terms of speed (~50 times faster in this example). Please just let me know If I am onto something and you would like me to submit a pull request. My apologies if I am totally off base with this!!

my_window_blocking <- function(data_a,
                               data_b,
                               window_blocking_var,
                               window_size = 1) {

  # return the vector containing the window blocking variable in each dataset 
  data_a_window_values <- pull(select(data_a,
                                      {{window_blocking_var}}))
  data_b_window_values <- pull(select(data_b,
                                      {{window_blocking_var}}))

  # unique values of the first datasets window blocking variable for looping
  data_a_unique_vals <- sort(unique(data_a_window_values))

  # find a and b indices 
  out_a_indices <- vector('list',
                          length = length(data_a_unique_vals))
  out_b_indices <- vector('list',
                          length = length(data_a_unique_vals))

  for(i in 1:length(data_a_unique_vals)) {
    # a indices
    out_a_indices[[i]] <- which(data_a_window_values == data_a_unique_vals[i])

    # b indices
    min_window_value <- data_a_unique_vals[i] - window_size
    max_window_value <- data_a_unique_vals[i] + window_size

    out_b_indices[[i]] <- which(data_b_window_values %in% seq(min_window_value, 
                                                              max_window_value))
  }

  # return final data
  final_list <- vector('list',
                       length = length(data_a_unique_vals))

  for (k in 1:length(data_a_unique_vals)) {
    final_list[[k]][[1]] <- out_a_indices[[k]]
    final_list[[k]][[2]] <- out_b_indices[[k]]

    names(final_list[[k]]) <- c('dfA.inds', 
                                'dfB.inds')
  }
  return(final_list)
}

microbenchmark(my_window_blocking(test_a, 
                                  test_b,
                                  depth),
               blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth'),
               times = 10)

Thank you for your time and all of your hard work maintaining this great package. It is much appreciated.

Best, Ben

kosukeimai / fastLink

window blocking performance #70