kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Window blocking errors when the variable in `window.block` is integer #80

Open etiennebacher opened 5 months ago

etiennebacher commented 5 months ago

Hello, thanks for this package, I might use it for a project (still exploring and comparing with others for now).

Bug

I noticed that using window blocking fails when the variables in window.block are integer. Interestingly, this only fails when there are several variables specified in varnames (in other words, it doesn't fail when length(varnames) == length(window.block) == 1). Below is a small reproducible example:

library(fastLink)
data(samplematch)

# just take 20 obs to reduce computing time
dfA <- dfA[1:20,]
dfB <- dfB[1:20,]
class(dfA$birthyear)
#> [1] "numeric"

### Numeric columns: works fine
blockdata_out <- blockData(dfA, dfB, varnames = "birthyear", window.block = "birthyear", window.size = 1)
#> 
#> ==================== 
#> blockData(): Blocking Methods for Record Linkage
#> ==================== 
#> 
#> Blocking variables.
#>     Blocking variable birthyear using window blocking.
#> 
#> Combining blocked variables for final blocking assignments.

### Integer column when only one blocking variable: works fine
dfA$birthyear <- as.integer(dfA$birthyear)
dfB$birthyear <- as.integer(dfB$birthyear)

class(dfA$birthyear)
#> [1] "integer"
blockdata_out <- blockData(dfA, dfB, varnames =  "birthyear", window.block = "birthyear", window.size = 1)
#> 
#> ==================== 
#> blockData(): Blocking Methods for Record Linkage
#> ==================== 
#> 
#> Blocking variables.
#>     Blocking variable birthyear using window blocking.
#> 
#> Combining blocked variables for final blocking assignments.

### Integer column for window.block when several blocking variables
blockdata_out <- blockData(dfA, dfB, varnames = c("firstname", "birthyear"), window.block = "birthyear", window.size = 1)
#> 
#> ==================== 
#> blockData(): Blocking Methods for Record Linkage
#> ====================
#> Error in blockData(dfA, dfB, varnames = c("firstname", "birthyear"), window.block = "birthyear", : You have specified that a variable be blocked using window blocking, but that variable is not of class 'numeric'. Please check your variable classes.

Cause

The problem comes from those lines:

https://github.com/kosukeimai/fastLink/blob/da2e889448f70b4140a7cdebbe80ab963e867008/R/blockData.R#L129-L134

There are actually two issues here:

  1. lapply(dfA[,varnames], class) doesn't return the expected output when length(varnames) == 1. This is why the code above works when length(varnames) == length(window.block) == 1. For example:
mtcars <- mtcars[1:2, ]

# returns the class of *each value* in "mpg"
lapply(mtcars[, "mpg"], class)
#> [[1]]
#> [1] "numeric"
#> 
#> [[2]]
#> [1] "numeric"

# returns the class of the "mpg" column
lapply(mtcars[, "mpg", drop = FALSE], class)
#> $mpg
#> [1] "numeric"
  1. instead of checking that class(...) == "numeric", you could use is.numeric() instead:
    
    float <- c(1, 2)
    ints <- 1:2

integers are not detected as numeric

class(float) == "numeric"

> [1] TRUE

class(ints) == "numeric"

> [1] FALSE

integers are detected as numeric

is.numeric(float)

> [1] TRUE

is.numeric(ints)

> [1] TRUE


### Proposed fix

There are many ways to fix that, I'm just suggesting one here:

```r
window_block_A_is_num <- vapply(dfA[, window.block, drop = FALSE], is.numeric, FUN.VALUE = logical(1L))
window_block_B_is_num <- vapply(dfB[, window.block, drop = FALSE], is.numeric, FUN.VALUE = logical(1L))
if(!all(window_block_A_is_num) || !all(window_block_B_is_num)){
  stop("You have specified that a variable be blocked using window blocking, but that variable is not of class 'numeric'. Please check your variable classes.")
}

Thanks again for the package