kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage

272 stars 48 forks source link

Log(X) NaNs produced #81

Open kslungaardmumma opened 6 months ago

kslungaardmumma commented 6 months ago

Hello,

I am running a (looped) script using fastlink. The script runs (and seems to work) but at the end I get a list of 50 warnings "In log(x): NaNs produced." I assumed that this probably has to do with the likelihood function and isn't generally something to be concerned about re: affecting the output-- does that seem right? I am not able to produce a reproducible sample here since this project uses restricted-use data and I am unable to reproduce the issue with the sample data.

Thanks!

aalexandersson commented 6 months ago

Disclaimer: I am a regular user of fastLink, not a developer.

I am not aware of a best way to handle this warning message. Are there negative values in the dataset? Are you able to show the script (code only, no data)?

kslungaardmumma commented 6 months ago

Yes - attached. I wrote in some "XXX"'s for file paths. There should not be negative values in any fields.

On Tue, Apr 30, 2024 at 3:45 PM Anders Alexandersson < @.***> wrote:

Disclaimer: I am a regular user of fastLink, not a developer.

I am not aware of a best way to handle this warning message. Are there negative values in the dataset? Are you able to show the script (code only, no data)?

— Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/81#issuecomment-2086897461, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDQY4GWZGO275QUVW5L2VSDY77YD7AVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBWHA4TONBWGE . You are receiving this because you authored the thread.Message ID: @.***>

aalexandersson commented 6 months ago

Sorry, I cannot see your attached file. Maybe just paste it? Make sure to preview before sending. Markdown is supported.

kslungaardmumma commented 6 months ago

install.packages("fastLink")

install.packages("purrr")

install.packages("tidyverse")

install.packages("tidyr")

install.packages("stringdist")

install.packages("Matrix")

install.packages(c("fastLink", "xtable", "tidyverse", "ggthemes",

"gridExtra", "grid", "data.table", "knitr",

"doParallel",

"parallel", "lattice", "stringdist", "RecordLinkage"))

rm(list = ls()) library(fastLink) library(purrr) library(tidyverse) library(foreign) library(dplyr) library(tidyr)

start.time <- Sys.time()

############################################################################

fuzzy match: kid/birth records

adjust the states as needed to run

states <-list("IN", "IL", "KY")

states <-list("KY") yrlist=c(1980, 1990, 2000)

for (stat in states){ for (yr in yrlist){ yr2=yr+9 setwd("XXX")

dfA<-read.csv("students_fuzzy.csv") dfA<-subset(dfA, birthyear>=yr & birthyear<=yr2)

this is the path for the voting data

dfBname<-paste("XXX", stat, sep="") dfBname<-paste(dfBname,"XXX", sep="") dfBname<-paste(dfBname,stat,sep="") dfBname<-paste(dfBname, yr, sep="") dfBname<-paste(dfBname,yr2,sep="") dfBname<-paste(dfBname, ".csv", sep="")

dfB<-read.csv(dfBname)

names(dfB)[names(dfB) == "voters_male"] <- "male" names(dfB)[names(dfB) == "birthyr"] <- "birthyear"

dfA$ID <- seq.int(nrow(dfA))

dfB$ID2 <- seq.int(nrow(dfB))

dfA <- transform(dfA, birthyear = as.numeric(birthyear), birthmonth = as.numeric(birthmonth), birthday = as.numeric(birthday))

dfB <- transform(dfB, birthyear = as.numeric(birthyear), birthmonth = as.numeric(birthmonth), birthday = as.numeric(birthday))

blockgroups <- blockData(dfA, dfB, varnames = c("birthyear", "male"))

dfA_allblocks<-list() dfB_allblocks<-list() matches_old<-data.frame()

for (i in 1:length(blockgroups)) {

dfA_allblocks[[i]] <- dfA[blockgroups[[i]]$dfA.inds, ] dfA_block <- dfA[blockgroups[[i]]$dfA.inds, ] dfB_allblocks[[i]]<- dfB[blockgroups[[i]]$dfB.inds, ] dfB_block<- dfB[blockgroups[[i]]$dfB.inds, ]

matches.out <- fastLink( dfA = dfA_block, dfB = dfB_block, varnames = c("firstname", "lastname", "middlein", "fullname", "birthmonth", "birthday"), stringdist.match = c("firstname", "lastname", "fullname", "middlein"), numeric.match = c("birthmonth", "birthday"), partial.match = c("firstname", "lastname","fullname"), verbose = TRUE, threshold.match = 0.855, )

matchesA_other <- dfA_block[matches.out$matches$inds.a,] matchesB_other <- dfB_block[matches.out$matches$inds.b,] print("Here") matches_other <- matchesB_other if (exists("matches.out")){ matchesA_other <- dfA_block[matches.out$matches$inds.a,] matchesB_other <- dfB_block[matches.out$matches$inds.b,] print("Here") matches_other <- matchesB_other } print("here2") if(exists("matches_other") & !is.null(matches_other)){ matches_other$pattern <- do.call(paste, matches.out$patterns) print("diagnose me") matches_other$posterior <- matches.out$posterior print("diagnose me2") matches_other$student_alternate_id<- matchesA_other$student_alternate_id matches_other$studfirstname<- matchesA_other$firstname matches_other$studmiddlename<- matchesA_other$middlename matches_other$studmiddlein<- matchesA_other$middlein matches_other$studlastname<- matchesA_other$lastname matches_other$studbirth_date<- matchesA_other$birth_date matches_other$studbirthyear<- matchesA_other$birthyear matches_other$studbirthmonth<- matchesA_other$birthmonth matches_other$studbirthday<- matchesA_other$birthday matches_other$studfullname<-matchesA_other$fullname print("diagnose me3")

matches_other<-rbind(matches_old, matches_other)
matches_other$posterior <- format(matches_other$posterior, decimal.mark

= ".",digits = 4)

print(i)
matches_old<-matches_other
#rm(matches.out)

} }

setwd("XXX") print("writing out") outname<-paste("FLkidsvote", stat, "", yr, "", yr2,".csv", sep="") write.csv(matches_old, outname, row.names=FALSE) } } end.time <- Sys.time() time.taken2 <- round(end.time - start.time,2) time.taken2

On Tue, Apr 30, 2024 at 4:09 PM Anders Alexandersson < @.***> wrote:

Sorry, I cannot see your attached file. Maybe just paste it? Make sure to preview before sending. Markdown is supported.

— Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/81#issuecomment-2087027514, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDQY4GWUILAZ3PBD7XR6B6TY773BFAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXGAZDONJRGQ . You are receiving this because you authored the thread.Message ID: @.***>

aalexandersson commented 6 months ago

Do the warning messages occur from the for loop, and/or from the code before or after the for loop?

Are all the variables mostly complete (little missing data) -- even "middlein"? Also, it seems excessively redundant to link on both "fullname" and, at the same time, all the name parts: "firstname", "lastname", "middlein".

kslungaardmumma commented 6 months ago

Hi-

Middlein is missing a lot of data.

The warnings only display after I run the full code (including the loop). Is there a way to tell where the warning is traced to? The message I get is just “50 warnings recorded - use warnings() to display” and then it shows this same warning again and again. I assume it must be related to the fast link because I can’t see where else logs come into play…

I include both full name and each name field separately because I have some concerns about which field middle/last are reported (especially for two part last names, like “Lopez Garcia”).

Does that help?

On Tue, Apr 30, 2024 at 6:55 PM Anders Alexandersson < @.***> wrote:

Do the warning messages occur from the for loop, and/or from the code before or after the for loop?

Are all the variables mostly complete (little missing data) -- even "middlein"? Also, it seems excessively redundant to link on both "fullname" and, at the same time, all the name parts: "firstname", "lastname", "middlein".

— Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/81#issuecomment-2087671755, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDQY4GR3NRLZIS4C7WGXRCTZAAOO5AVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXGY3TCNZVGU . You are receiving this because you authored the thread.Message ID: @.***>

aalexandersson commented 6 months ago

In practice, I have found fastLink to be unreliable with a lot of missing data, say >30%. Are the other linkage variables more complete? Do the warning messages disappear if variable "middlein" is omitted?

You can convert warnings to errors, and then trace the errors. See, for example, https://adv-r.hadley.nz/debugging.html#non-error-failures.

Best practices for linking on names is an important and difficult issue. I am concerned about using highly correlated variables, especially while having warning messages. I would try hard to get rid of the warning messages first (before optimizing the linkage). That is, start with a simple record linkage configuration that works without warning messages. Then expand from it, as needed until you can reproduce the issue. The current code seems overly complicated, for example why use both dfA_allblocks and dfA_block?

tedenamorado commented 6 months ago

Thank you, @aalexandersson, for your valuable insights as always.

If you remove middlein from the merge, do you still receive the same warnings?

It is important to note that to prevent numerical underflow caused by calculating extremely small probabilities, we use logarithmic transformations of all model parameters. At each iteration of the EM algorithm, we convert each parameter estimate back to its original scale. The issue might be that some probabilities are exceptionally tiny. For each block, you can verify this by examining matches.out$EM to see if the model parameters are too small.

Another possibility is that one of your blocks contains only a few observations for one of the datasets.

Please keep us updated!

Ted

kslungaardmumma commented 6 months ago

Hi Ted and Anders,

Thank you both for your input!

1) If I remove "fullname" (which is highly correlated with the other fields) for a subsample of data, I get a different error message (4X) "1: In emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b, :

The EM algorithm has run for the specified number of iterations but has not converged yet.”

2) If I also remove "middlein" (which has a lot of missingness), I get the same message but "fewer" instances of it (1X).

3) If I examine the output for a subsample using matches.out$EM, I do see that there is very small probability of finding a match (e.g. $p.m 1.029798566225282e-05).

Some additional context: this may be an instance where there are NOT many matches to be found. One dataset is records for a smaller sample of individuals and the other is voting records from a full state -- it's very possible that there are not many matches to be found in some pairings across states/years/genders. It is also an instance where there may not be many observations in some of the blocks (e.g. few people by genderxbirthyear) -- but the blocking is very helpful for speed.

Is it still appropriate (at least: not highly inappropriate) for me to use fastLink for this type of matching? It seems like there is still output created even when this warning occurs. This is a setting where there are many exact matches but I was attracted to fastLink because it provided a speedy way to also facilitate some "fuzzy" matching. (And it's so fast!)

Best,

Kirsten

On Wed, May 1, 2024 at 12:29 AM Ted Enamorado @.***> wrote:

Thank you, @aalexandersson https://github.com/aalexandersson, for your valuable insights as always.

If you remove middlein from the merge, do you still receive the same warnings?

It is important to note that to prevent numerical underflow caused by calculating extremely small probabilities, we use logarithmic transformations of all model parameters. At each iteration of the EM algorithm, we convert each parameter estimate back to its original scale. The issue might be that some probabilities are exceptionally tiny. For each block, you can verify this by examining matches.out$EM to see if the model parameters are too small.

Another possibility is that one of your blocks contains only a few observations for one of the datasets.

Please keep us updated!

Ted

— Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/81#issuecomment-2087959692, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDQY4GUXHW7N3Y2N4HV24GDZABVSTAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXHE2TSNRZGI . You are receiving this because you authored the thread.Message ID: @.***>

aalexandersson commented 6 months ago

What is the approximate run time with and without blocking? How many records are in each dataset? Do the warning and error messages disappear if you reduce the amount of blocking, for example, if you use the variable birthyear in the record linkage step rather than in the blocking step?

kslungaardmumma commented 6 months ago

If I take a subset of data and run it with blocking as in the original (block by year AND gender), it takes 18.1 minutes and I do get errors (depending on the subset of data).
If I revise the code and run it without ANY blocking -- code follows -- it takes 51.83 minutes. I don't get errors (at least not in the subsets of data I tried).

matches.out <- fastLink( dfA = dfA, dfB = dfB, varnames = c("firstname", "lastname", "birthmonth", "birthday", "birthyear"), stringdist.match = c("firstname", "lastname"), numeric.match = c("birthmonth", "birthday", "birthyear"), partial.match = c("firstname", "lastname"), verbose = TRUE, threshold.match = 0.855, )
If I reduce my code to block just on gender (and match on birthyear), I do still get the error ("Warning messages: 1: In emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b,") (at least in some subsets of data).
There are something like 700-800 K observations in dfA overall and about 1 million in dfB (though that depends on the state).

Getting rid of blocking did seem to get rid of the error message. But since the blocking saves a lot of time I'm inclined to want to keep it in because I have a lot of matching to conduct.

My question, then, is this: what is the warning trying to tell me could be going on? (What would be "wrong" about my output, given this warning)? I take the fastLink matches and then subject them to further processes for refinement (i.e., requiring that they exactly match on last name or birth date, etc.). Given that, should I be concerned?

Kirsten

On Wed, May 1, 2024 at 10:32 AM Anders Alexandersson < @.***> wrote:

What is the approximate run time with and without blocking? How many records are in each dataset? Do the warning and error messages disappear if you reduce the amount of blocking, for example, if you use the variable birthyear in the record linkage step rather than in the blocking step?

— Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/81#issuecomment-2088551522, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDQY4GTYV3QA4PBHJYOEE4TZAD4IHAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGU2TCNJSGI . You are receiving this because you authored the thread.Message ID: @.***>

aalexandersson commented 6 months ago

To learn more about the warning messages, you could either (as I suggested before) convert them to errors and then trace the errors, or you could compare the matched datasets to identify which records differ and how because of the difference in blocking.

Ted suggested two cause possibilities, and I agree. Another third possible cause could be that you have too few linkage variables for the EM algorithm to reach a stable, global maximum. Instead you may have unstable, local maxima. I suggest this since removing the blocking also removed all warnings and errors. Could you add more linkage variables, not correlated with the existing variables? Examples are social security number, phone number, email address, and street address.

I think the bigger question is: how many false positives (count or rate) are you willing to accept? Why did you change the threshold from the default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? Personally, I always run fastLink with a much higher threshold than the default; I typically use either 0.95 or 0.98 or even 0.99 because I am much more concerned about wrong matches (false positives) than missed matches (false negatives) due to my job position (dealing with sensitive cancer data). With the relatively low threshold of around 0.85, then my main concern would be false positives -- not a few warning messages from blocking. You may have different concerns.

If you want a simpler solution, then I recommend running the code without blocking since it does the job without warnings and errors and in less than 1 hour.

kslungaardmumma commented 6 months ago

This has all been exceptionally helpful - thank you so much! I will take a look at how matching differs across different specifications.

Unfortunately, I don't have other variables I can use for matching.

However, I should note that I use fastLInk as "first pass" to generate matches. I then only accept matches that meet certain criteria (including exact matching on last name, birth date, and/or full name) to further refine the matches I accept as "true." Given that, I may be less concerned about false positives in the output from fastLink than other users and more willing to accept some (inevitable) measurement error.

I will play around with this some more and see if I can land on the solution that seems to output matches that meet my needs (ideally minimizing pesky warnings).

Thanks!

Kirsten

On Wed, May 1, 2024 at 3:01 PM Anders Alexandersson < @.***> wrote:

To learn more about the warning messages, you could either (as I suggested before) convert them to errors and then trace the errors, or you could compare the matched datasets to identify which records differ and how because of the difference in blocking.

Ted suggested two cause possibilities, and I agree. Another third possible cause could be that you have too few linkage variables for the EM algorithm to reach a stable, global maximum. Instead you may have unstable, local maxima. I suggest this since removing the blocking also removed all warnings and errors. Could you add more linkage variables, not correlated with the existing variables? Examples are social security number, phone number, email address, and street address.

I think the bigger question is: how many false positives (count or rate) are you willing to accept? Why did you change the threshold from the default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? Personally, I always run fastLink with a much higher threshold than the default; I typically use either 0.95 or 0.98 or even 0.99 because I am much more concerned about wrong matches (false positives) than missed matches (false negatives) due to my job position (dealing with sensitive cancer data). With the relatively low threshold of around 0.85, then my main concern would false positives -- not a few warning messages from blocking. You may have different concerns.

If you want a simpler solution, then I recommend running the code without blocking since it does the job without warnings and errors and in less than 1 hour.

— Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/81#issuecomment-2088931305, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDQY4GVXFMTCUSUWKYBOP43ZAE3YJAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYHEZTCMZQGU . You are receiving this because you authored the thread.Message ID: @.***>

kslungaardmumma commented 6 months ago

One follow-up here: I have converted warnings into errors. It is clear the error comes in the fastLInk process. It occurs in some cases (not all) even when the block groups are large for both dfA and dfB.

The output looks like this: "Running the EM algorithm Iteration number 100.... Maximum difference in log-likelihood = 0.2412 Iteration number 5000 Maximum difference in log-likelihood = 0.2412 Error in emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b, : (converted from warning) The EM algorithm has run for the specified number of iterations but has not converged yet."

So it seems like the EM algorithm is not converging in some cases -- it certainly could be related to calculating very small probabilities. This is a case where there are very few matches that are likely to be found. The warning doesn't kill the function and a (small) number of matches are found, even for the blocks where this warning appears to occur. I tried fiddling with tol.em and that did not seem to make a difference. Excluding missing variables also didn't consistently help.

Am I right that what this means is that the algorithm has not found a stable solution, but it is just outputting whatever it has at the end of the specified number of iterations (5000)? As a "good enough" solution, this might do -- I am getting match rates that are in line with my expectations for these "low match" situations.

On Wed, May 1, 2024 at 3:10 PM Mumma, Kirsten @.***> wrote:

This has all been exceptionally helpful - thank you so much! I will take a look at how matching differs across different specifications.

Unfortunately, I don't have other variables I can use for matching.

However, I should note that I use fastLInk as "first pass" to generate matches. I then only accept matches that meet certain criteria (including exact matching on last name, birth date, and/or full name) to further refine the matches I accept as "true." Given that, I may be less concerned about false positives in the output from fastLink than other users and more willing to accept some (inevitable) measurement error.

I will play around with this some more and see if I can land on the solution that seems to output matches that meet my needs (ideally minimizing pesky warnings).

Thanks!

Kirsten

On Wed, May 1, 2024 at 3:01 PM Anders Alexandersson < @.***> wrote:

To learn more about the warning messages, you could either (as I suggested before) convert them to errors and then trace the errors, or you could compare the matched datasets to identify which records differ and how because of the difference in blocking.

Ted suggested two cause possibilities, and I agree. Another third possible cause could be that you have too few linkage variables for the EM algorithm to reach a stable, global maximum. Instead you may have unstable, local maxima. I suggest this since removing the blocking also removed all warnings and errors. Could you add more linkage variables, not correlated with the existing variables? Examples are social security number, phone number, email address, and street address.

I think the bigger question is: how many false positives (count or rate) are you willing to accept? Why did you change the threshold from the default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? Personally, I always run fastLink with a much higher threshold than the default; I typically use either 0.95 or 0.98 or even 0.99 because I am much more concerned about wrong matches (false positives) than missed matches (false negatives) due to my job position (dealing with sensitive cancer data). With the relatively low threshold of around 0.85, then my main concern would false positives -- not a few warning messages from blocking. You may have different concerns.

If you want a simpler solution, then I recommend running the code without blocking since it does the job without warnings and errors and in less than 1 hour.

— Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/81#issuecomment-2088931305, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDQY4GVXFMTCUSUWKYBOP43ZAE3YJAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYHEZTCMZQGU . You are receiving this because you authored the thread.Message ID: @.***>

aalexandersson commented 6 months ago

My understanding is that the EM model must converge to have valid, stable results. Does the EM model converge when there is no blocking? Does the model converge when you remove the problematic variable middlein?