sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.39k stars 133 forks source link

create get_not_dupes() function #18

Open sfirke opened 8 years ago

sfirke commented 8 years ago

Ran into this today, where we want to see if anyone w/ the same ID had specified different values for race columns. Used get_dupes, then looked at IDs that were not in the duplicated tables.

The use case is for cleaning data, when all records should have a duplicate. I'm not sure how to handle records where there's only one instance of the unique ID. Should it return all unique rows of the specified variables, and thus those? Then for this use case you'd have to start by filtering the table for records where the ID appears at least twice. Makes for a simpler function, but if you always have to pre-filter for it to be useful, maybe I should bake that in.

Let's start simple: takes a df and variable names, returns a df of the rows that didn't share those variable combinations. The opposite of get_dupes() which is nice.

sfirke commented 8 years ago

To point out the obvious, this is different than distinct() from dplyr as that would return one record of a duplicated set; I want no records.

sfirke commented 8 years ago

A month later, this does not seem useful enough to make a permanent function for.

sfirke commented 8 years ago

Now I think I have a need for this again. I have two ID columns and want to understand if they are 1-to-1. Does ID A ever appear without the same ID B? It would be hierarchical, as an ID shouldn't appear with multiple values of say location but of course the same location will have multiple values of ID.

Maybe I need a function check_one_to_one that takes multiple variables and checks whether there is any violation of 1:1.

Check first to see if someone else has coded this?

rgknight commented 8 years ago

Great work on janitor! Awesome to see Ed types building real tools.

This is actually two different issues

Question 1: Are there duplicates?

Your introduction to get_dups states

This is for hunting down and examining duplicate records during data cleaning

Hunting down is different from examining. Hunting should be a different (and faster re #67 ) function.

I'd recommend an is_id or has_dupes function instead of check_one_to_one. It's the same idea: are these combinations unique?

The workflow is: Are there duplicates (has_dups)? If yes, what do I do about them (get_dups)?

Stata has an isid implementation that I used for this purpose, back in my Stata days. Helpfile here.

You'd be looking for a more pipe-able, NSE version of

is_id <- function(x){
  numdups <- sum(duplicated(x))
  if (numdups > 0){
    stop(sprintf("There are %i duplicates in %s", numdups, deparse(substitute(x)))) 
  }
}

Question 2: Can I get the elements of a data frame that are never duplicated?

In the handling duplicates workflow, I will sometimes separate the elements that are ever duplicated from the elements that are never duplicated, use a bunch of business logic to manipulate the ever duplicated elements, then recombine them. I think a get_nondups function could be worthwhile.

Here's what I use:

sep_dups <- function(df, ...){
  target <- df %>% select_(.dots=...)
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)

  list(unique = df[!dup_index, ],
       duplicates = df[dup_index, ])
}
jzadra commented 1 year ago

Just a random thought: a sankey diagram would visually indicate what we are discussing here - maybe some of the code that goes into organizing that data from a plotting package could be used as reference?