sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.38k stars 132 forks source link

Proposal for dups_ family of functions #75

Closed rgknight closed 7 years ago

rgknight commented 7 years ago

It appears that janitor is evolving to have a family of duplicate-handling functions (#72 , #18).

I would like to propose that we adopt a common naming convention of dup_ for this family of functions.

The core functions that we could identify right now would be:

This does read a little more awkward to me than the reverse (check_dups, count_dups, etc), but it would allow for users to see all available duplicate handling functions by typing "dup_".

The workflow would be:

A basic implementation could be:

dups_get <- function(df, ...){
  vars <- lazyeval::lazy_dots(...)
  target <- df %>% select_(.dots = vars)
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)

  df[dup_index, ]
}
dups_count <- function(df, ..., sort = FALSE){
  vars <- lazyeval::lazy_dots(...)

  dplyr::count_(df, vars, sort = sort) %>%
    dplyr::filter(n>1) %>%
    dplyr::rename(.duplicates=n)
}
dups_check <- function(df, ..., level="warn"){

  levels <- c("warn", "stop", "message", "logical")

  if (!(level %in% levels)){
    stop(sprintf("level must be on of %s", paste0(levels, collapse = ", ")))
  }

  vars <- lazyeval::lazy_dots(...)
  target <- df %>% select_(.dots=vars)

  num_dups <- sum(duplicated(target))

  dup_message <- sprintf("There are %d duplicates", num_dups) # would be nice to parse the df name instead of gettting "." when piped

  if (level == "logical"){
    return(num_dups != 0) # Logical returns true/false only
  }
  else if (num_dups == 0){
    invisible(df)
  }
  else if (level == "message"){
    message(dup_message) # Would be nice to return the call as well
  }
  else if (level == "warn"){
    warning(dup_message)
  }
  else if (level == "stop"){
    stop(dup_message)
  }
  invisible(df)
}

Example workflow

# Make sure no duplicate cars were added since the last time we loaded the data
mtcars %>%
  mutate(make = row.names(.)) %>%
  dups_check(make, level = "stop")

# Check whether any Species have the same average Sepal.Width
iris %>%
  group_by(Species) %>%
  summarise(ave_sepal_width = mean(Sepal.Width)) %>%
  dups_check(ave_sepal_width) %>%
  select(Species) # Silly example of pipe-ability

# Replicate current get_dupes
if (dups_check(iris, Species, level = "logical")){
  dup_num <- dups_count(iris, Species)
  dup_all_count <- iris %>%
    dups_get(Species) %>%
    left_join(dup_num)
}

Thoughts?

rgknight commented 7 years ago

An alternative would be to keep the get_ syntax and develop families of check_, count_ and get_ functions.

For example, you could have:

And so on. This may be a better fit for the package given the clean_names syntax.

In any case I think it would be worth making a stand one way or another on whether it's verb_object or object_verb, then being very rigorous about consistent naming, e.g., if it's verb_object should ns_to_percents be convert_n_percent?

Let me know what you think @sfirke and I can submit a PR for either check_dupes or dups_check (if you think it would be helpful -- I'm finding it very useful in my work).

sfirke commented 7 years ago

Thanks for thinking through this in such detail. It'll take me a little while to fully respond, as I'm in a crunch time right now, but I wanted to acknowledge that I saw and appreciate this. On the check_dupes: function: have you seen the assertr package? It does in-line data quality checks for productionized analysis.

rgknight commented 7 years ago

I understand, also in a crunch time here.

Re: naming conventions

Having played with this a bit, it now seems to me like [verb][object] is the more natural syntax, since it's a six to one, half a dozen to the other thing, and [verb][object] mostly matches the existing janitor conventions and also matches the conventions of packages like devtools (e.g., install_[location]).

It may still be worth thinking about what verbs you would want janitor to have and which of the proposed functions you would want to include.

Re: checking functions

I have taken a look at assertr. It looks helpful. I have started using loggr in my work, which I like because you specify the level of the alerts and it works with errors generated by any R function.

In any case, per the mission of janitor, the goal of any check functions would not be to create new features unavailable elsewhere, but rather to wrap the more general purpose tools to make a few of the most common checking predicates highly accessible.

sfirke commented 7 years ago

I agree that verb_object is the way to go. I'm open to check_dupes(), per your argument that janitor makes other package functionality highly accessible to beginners. Perhaps we even note in the documentation where they can find more check_-type functionality in other packages. It still feels a little bit like mission creep to open the can of worms of the assert-type checks, but let's take a stab at it. I know I'd use it in production immediately, which is a good test (and you note you've also found it useful).

I'm making more notes over in the PR.

sfirke commented 7 years ago

I see you've broken the dupe_count column from get_dupes() into the dups_count function and then changed get_dupes() into dups_get. I think it's simpler with one function, rather than toggling between them (when usually I want how many dupes and the offending rows to examine). But your implementation above of dups_get is much cleaner and likely faster than get_dupes as long as we can add back in the dupe_count column. Which we could rename as long as we have so many other changes... maybe "n_dups" or "count_of_dups"?

rgknight commented 7 years ago

I will break the PR up into bit sized pieces -- that definitely makes more sense.

I hear you on the mission creep of the style of checking I proposed. I think that we should scale it back so that janitor includes only some standard predicates used in data cleaning / checking, not the assertion handling. So check_dupes would become no_dupes or is_id (similar to the Stata is.id, which I used all the time back in the day) and would return only true/false, instead of all the level nonsense I proposed. This would let you pass it into whichever checking paradigm you want (stopifnot or assertr or assertthat or testthat or etc).

Sorry for the confusion on the get_dupes name -- I thought that I had rolled back the renaming in a later commit. Would you be OK with adding the count as an option to get_dupes? Or break get_dupes into two functions, one that returns just the duplicates and one that returns duplicates adorned with the count? The simple version of get_dupes is 30x faster (#67), so I think it's worth making it accessible both with and without the count.

On the naming topic, I think you have something great in tabyl and crosstab. It would be helpful to think more about whether you want to treat these like ggplot that you can modify with geom_ or theme elements, or whether you want to handle modifications through options to tabyl and crosstab. If you want to modify them using separate functions, those functions should all start with the same prefix, so that new users know that this is something that primarily modifies a tabyl / crosstab (even if it could be used on any df). The prefix could be add_ or it could be adorn_ or something else entirely, but it will be more accessible to new users if all things that modify a tabyl / crosstab have the same prefix.

rgknight commented 7 years ago

Closing this based on our new approach to planning, which supplants the conversation here.