kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

Matching Dates #52

Open geneh0 opened 3 years ago

geneh0 commented 3 years ago

I have a column converted to date format (without time) using as.Date(), lubridate::as_date() and even `as.Date.character().

> df$date_of_contact[1]
[1] "2020-11-15"
> class(df$date_of_contact)
[1] "Date"

When I try to use that as a variable in fastLink, throws the error:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

How do I include dates as part of the match? I'm trying to do an exact match to deduplicate a dataframe.

aalexandersson commented 3 years ago

Disclaimer: I am a regular user, not a developer, of fastLink.

The problem is discussed on stackoverflow here. The easiest solution for you might be to use a character type, not a date type. Try this before running fastLink:

as.character(df$date_of_contact)

geneh0 commented 3 years ago

That's what I currently have in my script, but I'm not sure what the best way of doing it if it was an inexact match, say up to 13 day difference (updated my original post). String distance matching doesn't seem to make sense since translocated digits don't really have any significance.

aalexandersson commented 3 years ago

I would convert dates to a numeric variable such as total_days as discussed on stackoverflow here. Then, you can use fastLink with range arguments for a numeric match such as

numeric.match = "total_days", cut.a.num = 0.4, cut.p.num = 6.5

In the special case if your date variable instead is date of birth, then you can calculate age using eeptools::age_calc().

The documentation for cut.a.num and cut.p.num says

cut.a.num Lower bound for full numeric match. Default is 1
cut.p.num Lower bound for partial numeric match. Default is 2.5

I have two concerns with that documentation:

  1. Why is cut.a.num = 0 not allowed? The resulting error is not user-friendly:

    Error in { : 
    task 1 failed - "error in evaluating the argument 'x' in selecting a method for function 'which': subscript out of bounds" 
  2. What is the Upper bound for cut.a.num and cut.p.num (that is, the opposite of the "Lower bound")?

I think you mention a relatively simple but practically important issue which deserves better documentation.