kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

pre-processing dplyr/data.table columns #2

Closed kuriwaki closed 7 years ago

kuriwaki commented 7 years ago

Using datasets that are dplyr or data.table objects instead of simply data.frame isn't a good input for the gammaCKpar() inside fastLink.

It might not be worth changing the code because data.frame is the standard. I thought I'd post because there was no warning and it took me a while to figure this out when using fastLink. The code below should be reproducible.

library(fastLink)

## data frame management packages some people use
library(dplyr)
library(data.table) 

## example data
data(samplematch) 
class(dfA) # data.frame object

## Suppose these were dplyr objects, not data frames
## e.g. dplyr
dfA.dp <- tbl_df(dfA)
dfB.dp <- tbl_df(dfB)

## e.g. data.table
dfA.dt <- as.data.table(dfA)
dfB.dt <- as.data.table(dfB)

class(dfA.dp) 
head(dfA.dp[, "firstname"]) # not quite a vector, data will be ignored in gammaKpar
class(dfA.dt) 
head(dfA.dt[, "firstname"]) # same for data.table

## Run gammaCK for a given variable using syntax from fastLink()
varname.i <- "firstname"

agr    <- gammaCKpar(dfA[, varname.i], dfB[, varname.i]) 
agr.dp <- gammaCKpar(dfA.dp[, varname.i], dfB.dp[, varname.i]) # no error message
agr.dt <- gammaCKpar(dfA.dt[, varname.i], dfB.dt[, varname.i]) # no error message

length(agr$matches1)    # well-populated
length(agr.dp$matches1) # empty
length(agr.dt$matches1) # empty
kosukeimai commented 7 years ago

Thanks. @tedenamorado and @bfifield let's make sure that both types can be handled in the code and this should be noted in the documentation too.

tedenamorado commented 7 years ago

Will do! Thanks for pointing that out Shiro. We will add a warning if someone wants to use something different from data.frame, data.table, or dplyr.