kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Long runtime on sampled data #53

Open emcghee73 opened 3 years ago

emcghee73 commented 3 years ago

I'm trying to get fastLink to merge two copies of the California voter file that are four years apart--one from 2012 and one from 2016. My strategy is to use the method from the APSR paper (at least as I understand it). But I'm getting stuck on what I thought would be the fast part.

I'm running fastLink first with 5% samples of each file. I then plan to block the full files on gender plus as many bins of first name as it takes to get down to about 250K cases in each bin (again, copying the APSR paper). I was assuming that it was best to run the sampled stage without blocking, because the blocks from the sampling needed to match up to the blocks from the full file which would leave too few units to match in each block (because the samples are so much smaller).

Bottom line is that I'm just brute-forcing the sampling stage. Here's the code (each file has been de-duped before hand):

d12.sub <- sample_frac(d12, size=0.05) d16.sub <- sample_frac(d16, size=0.05)

rs.out <- fastLink( dfA = d12.sub, dfB = d16.sub, varnames = c("lname", "fname", "mname", "latlong", "bdate"), stringdist.match = c("lname", "fname"), partial.match = c("lname", "fname"), estimate.only = TRUE )

Now that I look at this, I realize that "latlong" is a string but wasn't identified that way. "mname" is also a string but with length one (i.e., just a middle initial). Not sure if that creates problems. At any rate, this has been running for the last 6 days, and is stuck here:

==================== fastLink(): Fast Probabilistic Record Linkage ====================

If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic. Calculating matches for each variable. Getting counts for parameter estimation. Parallelizing calculation using OpenMP. 54 threads out of 55 are used.

As you can see, I'm running this on servers that have a lot of parallelization capacity. Am I doing something wrong? If not, are there any recommendations for how to speed this up? Does it make sense for something like this to run so long? It has been running so long now that I'm now scared to stop it and play around, for fear it's just about to finish. Thanks in advance!

tedenamorado commented 2 years ago

Hi @emcghee73,

Matching voter files is not an easy task, so I am with you on how complicated this can be. However, it is possible and it is just a matter of fine-tuning the problem.

My suggestion would is that it would be better to block on gender first and then sample observations to obtain parameter estimates. Since you have a lot of data, maybe 5% is a bit high as there are more than 16 million registered voters in California for the years you mentioned.

Since the data is so large, blocking is always a good idea. An alternative approach we have followed is:

  1. Block by Gender and District
  2. Then conduct the merge in each block

Steps 1 and 2 will get you those observations that either move within the district or did not move at all. Then among those you could not find following steps 1 and 2, look for them across districts in California. Since a lot of observations would have been already matched, this last step will require a smaller search.

If we can be of any help, do not hesitate to let us know.

All my best,

Ted

emcghee73 commented 2 years ago

Thank you, Ted! I'll give these ideas a try and report back with progress.

Cheers, Eric

mpr1255 commented 2 years ago

...curious about the progress!