kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Using reweight.names in fastlink() returns only completely NA rows #62

Open brittlh opened 2 years ago

brittlh commented 2 years ago

I've run the fastLink function both with and without the reweight.names option to ensure the data is matched without issue otherwise.

Code:

fastLink(dfA = dfA, dfB = dfB, varnames = c("first", "last", "company"), stringdist.match = c("first", "last", "company"), stringdist.method = "lv", return.df = TRUE, reweight.names = TRUE, firstname.field = "first", dedupe.matches = FALSE, verbose = TRUE)

The matched data output includes NA cases; each field for each case is "NA":

image

Any idea what's gone wrong here? Thank you for looking into this.

tedenamorado commented 2 years ago

Hi,

Your code looks OK. Do you happen to have a reproducible example you could share with us? More than happy to take a look.

All my best,

Ted

brittlh commented 2 years ago

I wasn't able to create a reproducible scaled-down example, which led me to taking a SRS of the two datasets (10% of each) I'm working with to try again. This time, I received 18 rows back, of which 8 were NA and 10 were match rows. Is it possible the issue is linked to the size of data sets? (dfA has about 1k rows, dfB about 220k).

tedenamorado commented 1 year ago

Hi,

Are there NAs in the name variable?

All my best,

Ted

brittlh commented 1 year ago

Ted,

Did the check, no NAs. There were 2 "" blank strings. Once I filtered out for testing, I reran fastLink and got the same result as I described above.

Appreciate your help. I'm going to keep looking into this in my spare time and see if any other data anomalies catch my attention that might trigger this issue.

aalexandersson commented 1 year ago

Disclaimer: I am a regular fastLink user, not a fastLink developer.

Is the scaled-down dataset dfA about 1K rows or about 100 rows? Do the read in datasets look fine to you? Approximately how much missingness is there? How many exact matches are there? Can you show the linkage patterns for the 18 returned rows? No/Little overlap could be the cause...

Anders