Open sfirke opened 8 years ago
To point out the obvious, this is different than distinct()
from dplyr as that would return one record of a duplicated set; I want no records.
A month later, this does not seem useful enough to make a permanent function for.
Now I think I have a need for this again. I have two ID columns and want to understand if they are 1-to-1. Does ID A ever appear without the same ID B? It would be hierarchical, as an ID shouldn't appear with multiple values of say location
but of course the same location
will have multiple values of ID
.
Maybe I need a function check_one_to_one
that takes multiple variables and checks whether there is any violation of 1:1.
Check first to see if someone else has coded this?
Great work on janitor! Awesome to see Ed types building real tools.
This is actually two different issues
Question 1: Are there duplicates?
Your introduction to get_dups
states
This is for hunting down and examining duplicate records during data cleaning
Hunting down is different from examining. Hunting should be a different (and faster re #67 ) function.
I'd recommend an is_id
or has_dupes
function instead of check_one_to_one
. It's the same idea: are these combinations unique?
The workflow is: Are there duplicates (has_dups
)? If yes, what do I do about them (get_dups
)?
Stata has an isid
implementation that I used for this purpose, back in my Stata days. Helpfile here.
You'd be looking for a more pipe-able, NSE version of
is_id <- function(x){
numdups <- sum(duplicated(x))
if (numdups > 0){
stop(sprintf("There are %i duplicates in %s", numdups, deparse(substitute(x))))
}
}
Question 2: Can I get the elements of a data frame that are never duplicated?
In the handling duplicates workflow, I will sometimes separate the elements that are ever duplicated from the elements that are never duplicated, use a bunch of business logic to manipulate the ever duplicated elements, then recombine them. I think a get_nondups
function could be worthwhile.
Here's what I use:
sep_dups <- function(df, ...){
target <- df %>% select_(.dots=...)
dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)
list(unique = df[!dup_index, ],
duplicates = df[dup_index, ])
}
Just a random thought: a sankey diagram would visually indicate what we are discussing here - maybe some of the code that goes into organizing that data from a plotting package could be used as reference?
Ran into this today, where we want to see if anyone w/ the same ID had specified different values for race columns. Used
get_dupes
, then looked at IDs that were not in the duplicated tables.The use case is for cleaning data, when all records should have a duplicate. I'm not sure how to handle records where there's only one instance of the unique ID. Should it return all unique rows of the specified variables, and thus those? Then for this use case you'd have to start by filtering the table for records where the ID appears at least twice. Makes for a simpler function, but if you always have to pre-filter for it to be useful, maybe I should bake that in.
Let's start simple: takes a df and variable names, returns a df of the rows that didn't share those variable combinations. The opposite of
get_dupes()
which is nice.