kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

Question - Matching dataset against itself #29

Closed mclurkles closed 5 years ago

mclurkles commented 6 years ago

Hi,

This is a fantastic package. Thanks to you all.

One of the problems i'm trying to solve is determining duplicates within a single dataset. Do you have any ideas on how to use fastLink to ID same-set duplicates?

Cheers, Ewen

tedenamorado commented 6 years ago

Hi Ewen,

Thanks a lot for your kind words! We constantly work on fastLink to make it user-friendly and intuitive to use.

De-duplication is a problem that can be addressed using fastLink. The lines below present an example and should give you a good place to start:

library('fastLink')
## This dataset has 500 observations 
## 50 observations are duplicates
data("RLdata500", package = "RecordLinkage")

## Note that there are 500 + 50 * 2  = 600
## observations that should be matched

## In this case we know the truth, we will use this to 
## check performance:
RLdata500$true_id <- identity.RLdata500 

## We create an ID for each observation (rownumber)
RLdata500$id <- 1:nrow(RLdata500)

## Using fastLink for fiding duplicates within a datasetL
rl_matches <- fastLink(
  dfA                = RLdata500,  
  dfB                = RLdata500,
  varnames           = c("fname_c1", "lname_c1", "by", "bm", "bd"),
  stringdist.match   = c("fname_c1", "lname_c1"),
  dedupe.matches = FALSE, 
  return.all = FALSE
)

id1 <- RLdata500$id[rl_matches$matches$inds.a]
id2 <- RLdata500$id[rl_matches$matches$inds.b]

trueID1 <- RLdata500$true_id[rl_matches$matches$inds.a]
trueID2 <- RLdata500$true_id[rl_matches$matches$inds.b]

## You can check that we find 598 out of the 600
## matches. In other words we miss one duplicated
## observation.
sum(trueID1 == trueID2)

## Getting a UNIQUE ID
## Because in this exercise we have a symmetrical problem e.g.,
## if observation 1 in A matches 2 in B, observation 1 in B matches 2 in A,
## we will remove pairs on the lower diagonal of the sample space
keep <- id1 > id2

## link between original ID and the duplicated ID
id.duplicated <- id1[keep]
id.original <- id2[keep]

## We create a new id and replace the ID for the duplicates
RLdata500$id_new <- RLdata500$id
RLdata500$id_new[RLdata500$id_new %in% id.original] <- id.duplicated

If anything, please let us know.

All my best,

Ted PS. We will add a thorough discussion of this example and the overall problem of finding duplicates.

bfifield commented 5 years ago

Hi Ewen,

To follow up on this, we've taken the code that Ted posted above and folded it into the newest version of the package. If you run the following on the current version on master:

library(fastLink)
data(samplematch)

## Add duplicates
dfA <- rbind(dfA, dfA[sample(1:nrow(dfA), 10, replace = FALSE),])

## Run fastLink
fl_out_dedupe <- fastLink(
  dfA = dfA, dfB = dfA,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
)

## Run getMatches
dfA_dedupe <- getMatches(dfA = dfA, dfB = dfA, fl.out = fl_out_dedupe)

## Look at the IDs of the duplicates
names(table(dfA_dedupe$dedupe.ids)[table(dfA_dedupe$dedupe.ids) > 1])

## Show duplicated observation
dfA_dedupe[dfA_dedupe$dedupe.ids == 501,]

it will dedupe dataset A using PRL. We will be pushing this to CRAN in the newest version of fastLink within the next few days. Thanks again for suggesting this!