Closed felixhaass closed 5 years ago
@felixhaass Thanks for this, and sorry about the delay in responding! We're working on this in a separate branch (referenced above), and should push a fix in the next few days to Github. This will be in the next release to CRAN as well. We really appreciate the catch.
Great, thanks for the reply & fixing the issue. Looking forward to the release!
Hi,
thanks for the great package & the terrific documentation. It's extremely useful and I think many people are going to use it--I definitely will!
I'm having issues when deduplicating a dataset, where I have only few variables to match on. The goal is to generate a common ID for observations with a very similar string identifier. When I follow the deduplication procedure you sketch in the
README
, but remove all numeric variables and keep only a handful of the string variables, thegetMatches()
function produces a warning and the common ID is scrambled up.Reproducible Example
Here's a reproducible example to illustrate the problem:
This is basically the code from the "deduplication" section from the
README
file, but I've simply removed most of the matching variables to only three,firstname
,lastname
,city
.Problem description
Running
however, results in the following message:
When we look at the resulting data frame, it's clear that the matched IDs are somehow wrongly assigned:
Clearly, joseph and david don't match on any of the chosen variables.
User-written function works
Interestingly, this problem seems connected to #36. In #36, @mbcann01 provides a user-written function to extract matched pairs from the
fastLink
-object.Specifically, if we run the
fmr_add_unique()
function provided in #36 and follow the procedure described there, we can retrieve the correct IDs.gives us
where
id
indicates the common ID for duplicated matches (similar todedupe.ids
). Here the correct (i.e. the most similar) josephs are matched. (I know they're not the "correct" match, but the goal was to find the most similar ones.)Summary
Since the user-written function from #36 correctly retrieves the IDs for the most similar matches, the problem lies somehow in the construction of the deduplicated data.frame from
getMatches()
, and not in the matching process itself.Let me know if I can provide any additional info / code to help you fix this issue--if there is indeed an issue and I'm not doing something wrong here.
Thanks again for your work!
Best Felix