kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

question / documentation #45

Open kalakaru opened 3 years ago

kalakaru commented 3 years ago

How do I get a table with the following information:

rownumber dfA: rownumbers of dfA rownumber dfB: rownumbers of dfB that can be be linked to the corresponding rownumber of dfA similarity measure: how well can the rownumber of dfB be linked to the corresponding rownumber of dfA?

E.g.: dfA has 10 rows; dfB has 5 rows; These dataframes can be linked as follows:

rownumber dfA rownumber dfB similarity measure
1 2 0.94
2 NA 0
3 NA 0
4 1 0.93
5 4 0.92
6 5 0.92
7 NA 0
8 NA 0
9 NA 0
10 3 0.98

Legend: rownumber 2 of dfB dan be linked to rownumber 1 of dfA. The similarity measure of this link is = 0.94.

tedenamorado commented 3 years ago

Hi,

I hope all is well. While fastLink calculates the similarity measures for the observations in the cross-product of two datasets, such numbers get recycled as this is the most expensive task in terms of memory space.

If your interest is to recover these numbers after the observations have been linked, fastLink provides you with the agreement pattern for that specific pair of records. For example, in the case of rows 1 and 10 in your data, it would say that they AGREE, and for rows 4:6 it would say that the PARTIALLY AGREE. In fastLink, you can select two cutpoints to make such distinction e.g., for string-valued variables the defaults are 0.94 and 0.88, so anything with a score above 0.94 is considered AGREE and between 0.88 and 0.93 we say PARTIALLY AGREE. The function getMatches() can return such agreement values for you.

Please, if anything, do not hesitate to let us know.

All my best,

Ted

kalakaru commented 3 years ago

Hi Ted,

Thanks for the fast reply! I looked at your answer but I still didn't figure out how to get the table mentioned above for the df dfA and dfB. Could you maybe give me an example code?

Cheers!

tedenamorado commented 3 years ago

No problem! I think this matches your request when focusing on first names in our sample data.

## Load the package and data
library(fastLink)
data(samplematch)

## First Name
g1 <- gammaCKpar(dfA$firstname, 
                 dfB$firstname, 
                 cut.a = 0.94, 
                 cut.p = 0.88
                 )

temp2 <- list()
for(i in 1:length(g1$matches2)) {
  temp2[[i]] <- expand.grid(unlist(g1$matches2[[i]][1]), 
                           unlist(g1$matches2[[i]][2]))
}

table2 <- do.call('rbind', temp2)
table2$similarity <- "Agree"

temp1 <- list()
for(i in 1:length(g1$matches1)) {
  temp1[[i]] <- expand.grid(unlist(g1$matches1[[i]][1]), 
                           unlist(g1$matches1[[i]][2]))
}

table1 <- do.call('rbind', temp1)
table1$similarity <- "Partially Agree"

table <- rbind(table2, table1)
colnames(table) <- c("Row Number dfA", 
                     "Row Number dfB",
                     "Similarity level")
head(table)

Please, if anything, do not hesitate to let us know.

All my best,

Ted