Open geneh0 opened 3 years ago
Disclaimer: I am a regular user, not a developer, of fastLink
.
The problem is discussed on stackoverflow here. The easiest solution for you might be to use a character type, not a date type. Try this before running fastLink
:
as.character(df$date_of_contact)
That's what I currently have in my script, but I'm not sure what the best way of doing it if it was an inexact match, say up to 13 day difference (updated my original post). String distance matching doesn't seem to make sense since translocated digits don't really have any significance.
I would convert dates to a numeric variable such as total_days
as discussed on stackoverflow here. Then, you can use fastLink
with range arguments for a numeric match such as
numeric.match = "total_days", cut.a.num = 0.4, cut.p.num = 6.5
In the special case if your date variable instead is date of birth, then you can calculate age
using eeptools::age_calc()
.
The documentation for cut.a.num
and cut.p.num
says
cut.a.num Lower bound for full numeric match. Default is 1
cut.p.num Lower bound for partial numeric match. Default is 2.5
I have two concerns with that documentation:
Why is cut.a.num = 0
not allowed? The resulting error is not user-friendly:
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a method for function 'which': subscript out of bounds"
What is the Upper bound for cut.a.num
and cut.p.num
(that is, the opposite of the "Lower bound")?
I think you mention a relatively simple but practically important issue which deserves better documentation.
I have a column converted to date format (without time) using
as.Date()
,lubridate::as_date()
and even `as.Date.character().When I try to use that as a variable in fastLink, throws the error:
How do I include dates as part of the match?
I'm trying to do an exact match to deduplicate a dataframe.