Open sfirke opened 7 years ago
I could probably create such a function, but it would likely make inefficient calls to fuzzyjoin and be quite slow. Someone else could likely write it more eloquently, and to run faster. I only work with small data so less of a problem for me.
I think you have the wrong drob. I'm @danlovesproofs on twitter. (He seems cool though.)
ha I didn't realize that pasting a tweet would tag people - doh. Thanks for the heads up.
That was quicker than I thought to come up with something functional. Check out: https://gist.github.com/sfirke/c0bd2b9c4d4e044b040966841e19a73b
Missing much of the functionality above, but will do the job most of the time. Maybe just start by keeping it simple like this?
I will make this a Markdown file to show output:
I think multiple columns is too complex. It hurts my head. In which case, this could just take a vector as the input. Does that meet user needs?
Edit: as I was falling asleep, I decided that maybe two columns is okay. As long as the stringdist_join
functions can still calculate an overall distance. I'm on the fence... does anyone have some real-world multi-variable use cases to share? The gold standard use case here is first & last names, in two variables. In that case, the user could paste the variables together, but nicer to save them that. But I struggle to think of other common use cases for finding fuzzy duplicates of 2+ variables.
I have some really crude code related to this (solving the problem I tweeted about). It's not "fuzzy" but approaches from a different angle: looks for actual matches but one variable at a time. In a way that guards against label switching. Then forms putative groups / entities by looking for agreement across the variables. That's why I'm interested in this and the gist.
Dear @jennybc,
Here's how I was thinking about this:
The more general way is to learn the weights on appropriate distance measures for each column based on training data using some supervised technique and then match the rest.
dear @sfirke
As you already anticipate and indirectly state, there is no one distance measure that will work everywhere. Depends on the kinds of errors you expect etc. Idea is to generally support lots of straightforward distance measures. Some common include: various edit distance measures, but also edit distance weighted by typing errors (if stuff is typed but diff. it comes from voice recognition etc.), then if we autodetect dates or numbers in some columns, we could use absolute distance for those. I built some of this stuff ground up for merging shape file with electoral returns: http://projects.iq.harvard.edu/eda/data
If I ever revisit this function, note to self to consider wrapping the OpenRefine project: https://github.com/ChrisMuir/refinr
Inspired by https://twitter.com/JennyBryan/status/777953052129054720 (as well as @almartin82 and @chrishaid):
Input: a data.frame with column(s) specified, name of the fuzzy matching algorithm to use (like fuzzyjoin), and (optional) allowable distance on that metric to be used a threshold for returning as a duplicate. Output: Returns just the duplicate records, stored together, similar to
get_dupes
. But it also contains a unique key for each set of duplicates (maybe derived just the first of the fuzzy duplicates) in a new variable, and the distance from the anchor duplicate [if there are more than 2 duplicates, make one of them the anchor - maybe the most common, and as a tiebreaker the one closest to the other two].This would be used:
plyr::mapvalues
or similar.(2) assumes that it matters which near-duplicate value survives, in a way that can't be automated. If it doesn't, then also provide a convenience function that swaps the anchor duplicate value into all dupes in a cluster and removes the unique key and distance columns.