Hello, thanks for this package, I might use it for a project (still exploring and comparing with others for now).
Bug
I noticed that using window blocking fails when the variables in window.block are integer. Interestingly, this only fails when there are several variables specified in varnames (in other words, it doesn't fail when length(varnames) == length(window.block) == 1). Below is a small reproducible example:
library(fastLink)
data(samplematch)
# just take 20 obs to reduce computing time
dfA <- dfA[1:20,]
dfB <- dfB[1:20,]
class(dfA$birthyear)
#> [1] "numeric"
### Numeric columns: works fine
blockdata_out <- blockData(dfA, dfB, varnames = "birthyear", window.block = "birthyear", window.size = 1)
#>
#> ====================
#> blockData(): Blocking Methods for Record Linkage
#> ====================
#>
#> Blocking variables.
#> Blocking variable birthyear using window blocking.
#>
#> Combining blocked variables for final blocking assignments.
### Integer column when only one blocking variable: works fine
dfA$birthyear <- as.integer(dfA$birthyear)
dfB$birthyear <- as.integer(dfB$birthyear)
class(dfA$birthyear)
#> [1] "integer"
blockdata_out <- blockData(dfA, dfB, varnames = "birthyear", window.block = "birthyear", window.size = 1)
#>
#> ====================
#> blockData(): Blocking Methods for Record Linkage
#> ====================
#>
#> Blocking variables.
#> Blocking variable birthyear using window blocking.
#>
#> Combining blocked variables for final blocking assignments.
### Integer column for window.block when several blocking variables
blockdata_out <- blockData(dfA, dfB, varnames = c("firstname", "birthyear"), window.block = "birthyear", window.size = 1)
#>
#> ====================
#> blockData(): Blocking Methods for Record Linkage
#> ====================
#> Error in blockData(dfA, dfB, varnames = c("firstname", "birthyear"), window.block = "birthyear", : You have specified that a variable be blocked using window blocking, but that variable is not of class 'numeric'. Please check your variable classes.
lapply(dfA[,varnames], class) doesn't return the expected output when length(varnames) == 1. This is why the code above works when length(varnames) == length(window.block) == 1. For example:
mtcars <- mtcars[1:2, ]
# returns the class of *each value* in "mpg"
lapply(mtcars[, "mpg"], class)
#> [[1]]
#> [1] "numeric"
#>
#> [[2]]
#> [1] "numeric"
# returns the class of the "mpg" column
lapply(mtcars[, "mpg", drop = FALSE], class)
#> $mpg
#> [1] "numeric"
instead of checking that class(...) == "numeric", you could use is.numeric() instead:
float <- c(1, 2)
ints <- 1:2
integers are not detected as numeric
class(float) == "numeric"
> [1] TRUE
class(ints) == "numeric"
> [1] FALSE
integers are detected as numeric
is.numeric(float)
> [1] TRUE
is.numeric(ints)
> [1] TRUE
### Proposed fix
There are many ways to fix that, I'm just suggesting one here:
```r
window_block_A_is_num <- vapply(dfA[, window.block, drop = FALSE], is.numeric, FUN.VALUE = logical(1L))
window_block_B_is_num <- vapply(dfB[, window.block, drop = FALSE], is.numeric, FUN.VALUE = logical(1L))
if(!all(window_block_A_is_num) || !all(window_block_B_is_num)){
stop("You have specified that a variable be blocked using window blocking, but that variable is not of class 'numeric'. Please check your variable classes.")
}
Hello, thanks for this package, I might use it for a project (still exploring and comparing with others for now).
Bug
I noticed that using window blocking fails when the variables in
window.block
are integer. Interestingly, this only fails when there are several variables specified invarnames
(in other words, it doesn't fail whenlength(varnames) == length(window.block) == 1
). Below is a small reproducible example:Cause
The problem comes from those lines:
https://github.com/kosukeimai/fastLink/blob/da2e889448f70b4140a7cdebbe80ab963e867008/R/blockData.R#L129-L134
There are actually two issues here:
lapply(dfA[,varnames], class)
doesn't return the expected output whenlength(varnames) == 1
. This is why the code above works whenlength(varnames) == length(window.block) == 1
. For example:class(...) == "numeric"
, you could useis.numeric()
instead:integers are not detected as numeric
class(float) == "numeric"
> [1] TRUE
class(ints) == "numeric"
> [1] FALSE
integers are detected as numeric
is.numeric(float)
> [1] TRUE
is.numeric(ints)
> [1] TRUE
Thanks again for the package