kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

How to avoid the error "You have no exact matches ... Please drop this variable"? #19

Closed aalexandersson closed 6 years ago

aalexandersson commented 6 years ago

I tried to reproduce a linkage I had done with RecordLinkage (R) in order to get realistic testing but in fastLink on Windows I get the error "You have no exact matches ... Please drop this variable". Because I do not want to drop the variable from my analysis, how can I avoid the error?

The error message is because of lines 225-227 in fastLink.r

        if(sum(dfA[,varnames[i]] %in% dfB[,varnames[i]]) == 0){
            stop(paste0("You have no exact matches for ", varnames[i], ". Please drop this variable from your analysis."))
        }

I have an average PC with 8 GB of RAM. However, my datasets have about 75,000 and 3 million observations which is about 100 times more than in your example on pages 24-25 in the paper. To reduce the datasets about 100 times, therefore, I used clusterMatch() with 100 clusters. If I use "only" 50 clusters, then the error disappears but the run time is much slower.

The problem does not occur in RecordLinkage because RecordLinkage allows more blocking than just gender. The error in fastLink occurred for the 9-digit Social Security Number (SSN) which overall is the best matching variable, so I do not want to drop the variable.

I know you are working hard on other things such as the confusion matrix but the error is problematic. One possible quick fix might be to change the error to a warning only. Another possibility is that I drop the offending cluster(s) but that seems like a very crude work around. A third possible solution, which I suggest is better but harder to implement, would be to allow more restrictive blocking than only on gender in order to enable smaller blocking pairs, less need for many clusters, and faster run time.

--Anders

aalexandersson commented 6 years ago

Here is a follow-up comment with my too slow fastLink code on the two real confidential datasets. The probabilistic record linkage on the same datasets and same matching variables worked in RecordLinkage thanks to it allowing multiple blocking variables, not only gender (at the cost of multiple passes):

` library(fastLink) library(readstata13) # for using Stata data

Real test data

pat <- read.dta13("patient.dta") # ~75,000 obs
req <- read.dta13("requestor.dta") # 3 million obs

Take 5% random sample

pat.s <- (pat[sample(1:nrow(pat), nrow(pat) 0.05),]) req.s <- (req[sample(1:nrow(req), nrow(req) 0.05),])

Run EM algorithm on the random samples

rs.out <- fastLink( dfA = pat.s, dfB = req.s, varnames = c("ssn", "zip", "stnum", "dob", "sex", "nysf", "nysl"), stringdist.match = c("ssn", "zip", "stnum", "dob", "nysf", "nysl"), estimate.only = TRUE )

Create 100 clusters based on first name

cl.out <- clusterMatch(pat$nysf, req$nysf, nclusters = 100) # ~ max.n = 30,000 pat$cluster <- cl.out$clusterA req$cluster <- cl.out$clusterB

Apply fastLink() to the 100 clusters

N <- 100 # 50 creates too large datasets x <- vector("list", N) for (i in 1:N) { link.i <- fastLink( dfA = subset(pat, cluster == i), dfB = subset(req, cluster == i), varnames = c("ssn", "zip", "stnum", "dob", "sex", "nysf", "nysl"), stringdist.match = c("ssn", "zip", "stnum", "dob", "nysf", "nysl"), partial.match = c("ssn"), # no exact match in cluster 8 gender.field = c("sex"), threshold.match = 0.85, # default em.obj = rs.out, verbose = TRUE ) Ps <- link.i x[[i]] <- Ps }

Aggregate multiple matches together

linklist <- setNames(x, paste0("link.", 1:N)) agg.out2 <- aggregateEM(em.list = linklist)

Summarize the results

summary(agg.out2)`

aalexandersson commented 6 years ago

I should be able to speed things up by doing a small deterministic match beforehand. That will reduce the size of the two datasets and the speed related issues. I will let you know if it fails.

tedenamorado commented 6 years ago

Hi Anders,

Please keep us posted. The Random sampling strategy works well whenever you think there is a large overlap between datasets.

It looks that when you split the data, sometimes you have no information in the SSN for some clusters. Do you have missing values in such a variable?

Why are you treating SSN as a string variable? Using a string variable comparison for a numeric one is not ideal e.g., 108 is close to 180, while the numbers are quite apart.

One small suggestion: if you split the data by gender that would reduce the number of comparisons in half. Maybe if you add another for statement that loops by gender would help.

Keep us posted!

Ted

aalexandersson commented 6 years ago

Ted,

Thank you for the encouraging and helpful comment. I will continue to test your program and keep you posted.

I treated SSN as a string variable to avoid numeric precision problem. I agree that SSN should not be treated as a partial string. So, this line should be removed: partial.match = c("ssn") I had it in the test code because which other variable would I test as partial string match? I had no good alternative. For example, I used phonetic NYSIIS versions for first and last name (double metaphone is better though) and zip and street number for address. Therefore, also those four variables are not useful for that purpose.

However, I am not sure if there is a consensus on how to best treat SSN. The ADGN article (now accepted at https://doi.org/10.1080/2330443X.2017.1389620) does not discuss the issue but uses SSN4 in Table 1, which I interpret as an exact match of the last 4 digits only. Is fastLink faster for matching exact numbers than for exact strings? If yes, then I agree SSN should be numeric -- as should zip, stnum ("Street Number"), and dob ("date of Birth"); dob ideally should be matched as a date data type, perhaps partially as a date range.

Anders

tedenamorado commented 6 years ago

Hi Anders,

You raise a valid point about SSNs, still, the way the most string distances are calculated does not work well with numbers. fastLink is much faster for exact matches, regardless they are strings or numbers.

For a first pass, if I were you, I would use exact matching (gammaKpar) on zip code, SSN, and street number. In the case of dob, as you suggest, ideally one would love to calculate the age of an individual and use a numeric distance to separate comparisons in agreement categories. For now, I would advise you to use the exact match comparison in that field too. We have code for a function that implements comparisons for numeric variables based on an absolute difference, but we have not incorporated it to fastLink yet as we are making sure it works properly.

As noted before, please keep us posted about your project! Your feedback is really important to us.

Ted

aalexandersson commented 6 years ago

Hi Ted,

I set.seed(123) and was able to reproduce the error message for SSN. There were some missing SSNs in both datasets but I think instead the issue is that the overlap is very small between the two dataset (zero overlap when the error occurs). In the original linkage using RecordLinkage (R), the result was about 3,000 links.

Thank you for the suggestion for "a first pass". It's easier for me to troubleshoot this in Stata: When I use an exact match (Stata command merge 1:1) on zip code, SSN, and street number the result is about 2,000 links. What is your idea with "a first pass"? I see two possibilities: (A) a probabilistic first pass -- Multiple passes tend to result in complex code which I hope to avoid. (B) a deterministic first pass -- The simple idea is to merely implement more blocking than what fastLink() allows in order to reduce the datasets to more manageable sizes. For example, if I do not match on street number then the result is two equaal datasets with about 58,000 observations. Here is example Stata code how to implement the blocking on zip code and SSN9:

` // prepare for blocking use requestor, clear rename pid pid_requestor save requestor_tmp, replace use patient, clear rename pid pid_patient save patient_tmp, replace

// block on zip + ssn use requestor_tmp, clear joinby zip ssn using patient_tmp // adding stnum -> n=2,042 isid pid_requestor pid_patient keep pid_requestor pid_patient save block_zip_ssn, replace count // n=58,533

// restrict original datasets to blocking: n=58,533 use block_zip_ssn, clear merge m:1 pid_patient using patient_tmp, keep(match) nogen save patient_block, replace

use block_zip_ssn, clear merge m:1 pid_requestor using requestor2, keep(match) nogen save requestor_block, replace `

After running this Stata code, I have two datasets already matched on ZIP5 and SSN9 -- about (58,000 58,000) observations down from (75,000 3,000,000 observations). I think this should avoid the reported error in fastLink, and also drastically improve the speed. My main concern is that the blocking is somewhat too restrictive in the example since typos in SSNs are common but I think some fine tuning is possible. Perhaps ZIP5 and SSN4 (the last 4 digits) is a better blocking.

My main point is that blocking can effectively reduce the problem with big data for probabilistic record linkage, and that a prior deterministic match can be used as blocking if the software does not allow sufficient blocking. I will let you know how it goes. Please let me know what you think of the general idea.

Anders

aalexandersson commented 6 years ago

I made two changes: 1) I had wrongly blocked also on missing values. Removing missing values on SSN and ZIP in the blocking results in (2,181 2,181) observations which is too restrictive. 2) Relaxed the blocking from (SSN and ZIP) to (SSN4 and ZIP). SSN4 is the last four digits of the SSN. The result is (27,220 27,220) observations. This is easily managed by fastLink(), so I may even want to block even less prior to using fastLink(). Perhaps only block on SSN4 or only block on ZIP?

Will keep you posted! Anders

aalexandersson commented 6 years ago

After the pre-blocking on SSN4 and ZIP, I used 5% random sample as before and 3 (down from 100) clusters, with a run time of about 20 minutes which is about twice as fast as my RecordLinkage run. Recall that the original results using RecordLinkage were about 3,000 matches (I do not want to publish the exact number here).

Here is the fastLink() code only: link.i <- fastLink( dfA = subset(pat, cluster == i), dfB = subset(req, cluster == i), varnames = c("ssn", "stnum", "dob", "sex", "nysf", "nysl"), gender.field = c("sex"), threshold.match = 0.50, em.obj = rs.out, verbose = TRUE )

If threshold.match = 0.75 (or if 0.50 as above), the result is 1,061 exact matches and 1,738 matches with 95% confidence.

`

summary(agg.out) 95% 85% 75% Exact 1 Match Count 1738 1738 1738 1061 2 Match Rate 6.385% 6.385% 6.385% 3.898% 3 FDR 0.002% 0.002% 0.002%
4 FNR 0.004% 0.004% 0.004% ` I think I make great progress and I may have to compare the differences closely. However, I am still concerned that I do not yet fully understand fastLink: Is there a way to increase the number of matches other than lowering the cutoff for the option threshold.match?

tedenamorado commented 6 years ago

Lowering the threshold has the problem of increasing the False Discovery Rate. The only other way I can think of is to not remove duplicates. However, if what you want a is 1:1 match, I would not advice you to do so.