kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Guidance on improving chances EM algorithm will converge? #61

Open zross opened 2 years ago

zross commented 2 years ago

@tedenamorado and @kosukeimai thanks so much for making all of this hard work available in this R package! I'm wondering if you had published some guidance or suggestions on what situations lead the EM algorithm to fail to converge.

Unfortunately, my data is not shareable so I'm having trouble giving you a reprex but, broadly, I'm linking birth data with hospitalization data for many different years and I'm having trouble pinpointing what is causing a failure to converge. Sometimes it does, sometimes it doesn't converge.

It does seem that if I exclude any record with any NA value I get convergence more often. But I'd really like to keep these records and the proportion of NA in the variables (max 4.5%) does not "seem" too high. Excluding NA values, in any case, is not a solution that works often.

I'm running the linkage, in many cases, on a 200k subsample in my efforts to figure out where the issue is. Some facts:

  1. In most cases, I'm using DOB, last name, first name, race and municipal code
  2. None of these variables is more than 4.5% missing

Any guidance on what I might do to improve the chances the EM algorithm will converge?

lnk <- fastLink::fastLink(
  dfA = dfA,
  dfB = dfB,
  varnames =  c("lk_dob", "lk_last", "lk_first", "lk_race", "lk_muni_res"),
  # dob as string match with cut of 0.95 will give a match for a one-digit difference in last few numbers
  stringdist.match = c("lk_dob", "lk_last", "lk_first"),
  cut.a = 0.95,
  dedupe.matches = FALSE,
  threshold.match =  0.975,
  verbose = TRUE
)
aalexandersson commented 2 years ago

Disclaimer: I am a regular user, not a fastLink developer.

Does a simpler model without partial matching converge more reliably?

I would use age instead of date of birth (dob). Does dropping the race variable lead to more frequent convergence? (I work for the Florida cancer registry and almost never link on race because it is not reliable enough as a linkage variable.) It would help if you could add a linkage variable with more values such as SSN (in the US) or street number+ZIP code.

zross commented 2 years ago

I appreciate this. A few answers:

  1. I don't find partial vs non-partial matching changes things
  2. I can't use age since we might be, for example, looking at the mom in 1973 and seeing a hospitalization in 1980 so the age changes. But based on Age I computed year of birth but as this is less precise I'd prefer to use DOB
  3. Dropping race does not help consistently. I'll try looking again. Agreed it's not a reliable linkage variable but it can be a tiebreaker or add confidence. It's the variable with the most NA in many cases so I can do more experimentation.
  4. Yes! SSN, street etc would help. Unfortunately, especially in older births these variables don't exist. My kingdom for stronger linkage variables!
aalexandersson commented 2 years ago

Exact matching is much faster and simpler to compute, so it should converge without problems. How many exact matches are there?

How much is the overlap between the two datasets? fastLink struggles if you have close to 0% or 100% overlap. Imbalance matters too -- how large are the two datasets?

Did you count all missing as missing for sure? Often administrative datasets have hard-coded values such as 99 for missing which need to be recoded to NA before using fastLink.

Is birth sex available as a linkage variable?

zross commented 2 years ago

I really appreciate the time you've put in here, thank you.

There is very little overlap in many cases. So only a few new moms from 1980 would show up in hospitalization data from, say, 1990 in the same state. That could very well be a big part of the issue. In my initial testing I was testing on data that was closer in time assuming it would work as the gap got larger. So in initial testing I had convergence in many cases.

Answers to your questions:

  1. I'm not sure about exact matches in these datasets since I've been including partials. I will take a look and see.
  2. Yes, missing are missing. We actually have 16 different types of administrative data, live birth, fetal death, hospitalizations and mortality (with different time slices, each of which has a different format) so it took a lot of time to recode the "99", empty strings etc. It's not impossible something slipped through but I have a test/review in place for the linkage variables and I don't think this is an issue.
  3. I don't need birth sex at this point. I'm looking at moms so, of course, female in live birth data and I limit in hospitalization to female.

I'll experiment with removing the string matches, but I suspect you're right that in some of these datasets there will be very few true matches and this will be an issue.

aalexandersson2 commented 2 years ago
  1. The easiest way probably is to look at $patterns.w as described at https://github.com/kosukeimai/fastLink. I am interested in the rows with positive weights and "gammas" with value 2.

I am concerned that you will not be able to get useful results without stronger linkage variables.

bengoehring commented 1 year ago

Hi @zross -- just wanted to follow up to see if you gleaned any more tips for getting the EM algorithm to converge. Thanks!

zross commented 1 year ago

Not really. The missing values definitely play a role sometimes and it seems like over 20% or 30% will be a problem but not all of the non-convergence was related to this it seemed.

aalexandersson commented 1 year ago

Closed issue #30 seems similar, and there Ted gave some additional advice not yet mentioned here, e.g., changing the tolerance criteria. However, to me, the basic issue here still is that we have no output to comment on.

Re the amount of missing data, my experience is the same that it causes a convergence issue only if it is over over 30% or so.

tedenamorado commented 1 year ago

Hi,

Having a large number of missing values in one field can affect the model's ability to converge since it must rely on the available information. Another issue is when merging many fields that only have a few possible values, such as race or gender. In such cases, the model will rely on fields that provide more discriminating power, like first and last names.

One suggestion is to use partial matching instead of binary comparison for string-valued fields. Another idea is to provide different starting values for the relevant parameters. Currently, our fastLink wrapper function does not have an argument for different starting values, but we are revising it and plan to add them to the new version we will release this summer.

If anything, do not hesitate to let us know.

All my best,

Ted