sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.38k stars 131 forks source link

Get fuzzy duplicates #72

Open sfirke opened 7 years ago

sfirke commented 7 years ago

Inspired by https://twitter.com/JennyBryan/status/777953052129054720 (as well as @almartin82 and @chrishaid):

Any good #rstats packages or screeds on finding duplicate-ish observations? I.e. agree on several vars but not all, realism re: typos, etc.

@soodoku @abresler @drob the real missing piece is serving putative replicate groups back up to the analyst for vetting and action

Input: a data.frame with column(s) specified, name of the fuzzy matching algorithm to use (like fuzzyjoin), and (optional) allowable distance on that metric to be used a threshold for returning as a duplicate. Output: Returns just the duplicate records, stored together, similar to get_dupes. But it also contains a unique key for each set of duplicates (maybe derived just the first of the fuzzy duplicates) in a new variable, and the distance from the anchor duplicate [if there are more than 2 duplicates, make one of them the anchor - maybe the most common, and as a tiebreaker the one closest to the other two].

This would be used:

  1. Simply to check for near-duplicates
  2. To clean the duplicates. Thus populating a unique key for each set, as the analyst can make a crosswalk table off of the unique keys and swap those in via plyr::mapvalues or similar.

(2) assumes that it matters which near-duplicate value survives, in a way that can't be automated. If it doesn't, then also provide a convenience function that swaps the anchor duplicate value into all dupes in a cluster and removes the unique key and distance columns.

sfirke commented 7 years ago

I could probably create such a function, but it would likely make inefficient calls to fuzzyjoin and be quite slow. Someone else could likely write it more eloquently, and to run faster. I only work with small data so less of a problem for me.

drob commented 7 years ago

I think you have the wrong drob. I'm @danlovesproofs on twitter. (He seems cool though.)

sfirke commented 7 years ago

ha I didn't realize that pasting a tweet would tag people - doh. Thanks for the heads up.

sfirke commented 7 years ago

That was quicker than I thought to come up with something functional. Check out: https://gist.github.com/sfirke/c0bd2b9c4d4e044b040966841e19a73b

Missing much of the functionality above, but will do the job most of the time. Maybe just start by keeping it simple like this?

I will make this a Markdown file to show output:

sfirke commented 7 years ago

I think multiple columns is too complex. It hurts my head. In which case, this could just take a vector as the input. Does that meet user needs?

Edit: as I was falling asleep, I decided that maybe two columns is okay. As long as the stringdist_join functions can still calculate an overall distance. I'm on the fence... does anyone have some real-world multi-variable use cases to share? The gold standard use case here is first & last names, in two variables. In that case, the user could paste the variables together, but nicer to save them that. But I struggle to think of other common use cases for finding fuzzy duplicates of 2+ variables.

jennybc commented 7 years ago

I have some really crude code related to this (solving the problem I tweeted about). It's not "fuzzy" but approaches from a different angle: looks for actual matches but one variable at a time. In a way that guards against label switching. Then forms putative groups / entities by looking for agreement across the variables. That's why I'm interested in this and the gist.

soodoku commented 7 years ago

Dear @jennybc,

Here's how I was thinking about this:

  1. Specify a distance_measure_of_choice for each columns or some aggregate distance measure across columns (say net edit distance no greater than X, and then you just need to concatenate columns). Then whatever obs. turn up in one group, mark them as duplicate. The only feasible way to solve the problem for a lot of the data is a greedy approach, I think. As results will generally vary.
  2. We could also come up with a global optimal objective function like in clustering, which is I think a very reasonable way of thinking about this. Again the results will vary, classically if two observations are same distance away from two other obs.
  3. We could also cluster by one variable each in one go. And then look into 'spectral clustering' or 'ensemble clustering' ---- how often are the two observations in the same cluster as a heuristic for solving this. Was that what you were getting at?
soodoku commented 7 years ago

The more general way is to learn the weights on appropriate distance measures for each column based on training data using some supervised technique and then match the rest.

soodoku commented 7 years ago

dear @sfirke

As you already anticipate and indirectly state, there is no one distance measure that will work everywhere. Depends on the kinds of errors you expect etc. Idea is to generally support lots of straightforward distance measures. Some common include: various edit distance measures, but also edit distance weighted by typing errors (if stuff is typed but diff. it comes from voice recognition etc.), then if we autodetect dates or numbers in some columns, we could use absolute distance for those. I built some of this stuff ground up for merging shape file with electoral returns: http://projects.iq.harvard.edu/eda/data

sfirke commented 6 years ago

If I ever revisit this function, note to self to consider wrapping the OpenRefine project: https://github.com/ChrisMuir/refinr