tidyverse / funs

Collection of low-level functions for working with vctrs
Other
34 stars 7 forks source link

Implement is_distinct()? #51

Open JBGruber opened 4 years ago

JBGruber commented 4 years ago

As mentioned in a dplyr issue I think that an is_distinct() function would be a worthwhile addition to the tidyverse.

I used code from the current implementations of distinct.data.frame() and n_distinct() to write an example function of what I mean (from #44 I understand that you want to reimpliment more efficient versions of both here, otherwise I would have put this in a PR directly):

library(tidyverse)
df <- data.frame(col1 = c(1:3, 1, 4, 4),
                 col2 = c(1, 1:3, 4, 4))
is_distinct <- function(...) {
  columns <- map(enquos(..., .named = TRUE), rlang::eval_tidy)
  data <- as_tibble(columns, .name_repair = "minimal")

  loc <- vctrs::vec_unique_loc(data)

  out <- logical(nrow(data))
  out[loc] <- TRUE
  out
}
df %>% mutate(unique_case = is_distinct(col1, col2))
#>   col1 col2 unique_case
#> 1    1    1        TRUE
#> 2    2    1        TRUE
#> 3    3    2        TRUE
#> 4    1    3        TRUE
#> 5    4    4        TRUE
#> 6    4    4       FALSE

The difference compared to just using distinct() is that the user has control over what should happen with cases that aren't distinct. For example, one might want to analyse cases which are not distinct and see if they have something in common/are different from other cases. Thanks.

lionel- commented 4 years ago

columns <- map(enquos(..., .named = TRUE), rlang::eval_tidy)

We need to add a way of collecting dots to rlang with auto-naming of elements. However I'd capture the names separately here. If you're not data-masking it's better to collect with list2() because eval_tidy() always evaluates in a child environment, even if no mask is supplied (that's for technical reasons). But here you can also use tibble().

Also there is vec_duplicate_detect() in vctrs:

detect_distinct <- function(...) {
  !vctrs::vec_duplicate_detect(tibble(...))
}

We are experimenting with detect_ as the verb for vectorised predicates. The idea is that is_ functions should only return a non-missing single boolean, so they are always safe to use within if () conditions.

lionel- commented 4 years ago

oops vec_duplicate_detect() is not completely right here:

df %>% mutate(unique_case = is_distinct2(col1, col2))
#>   col1 col2 unique_case
#> 1    1    1        TRUE
#> 2    2    1        TRUE
#> 3    3    2        TRUE
#> 4    1    3        TRUE
#> 5    4    4       FALSE
#> 6    4    4       FALSE

A distinct function should be consistent with dplyr::n_distinct():

n_distinct(df)
#> [1] 5

This API will be reviewed in the next vctrs version (for instance vec_duplicate_detect() will be renamed to vec_detect_duplicate()) and we'll probably have vec_detect_unique() which could be used here.

I wonder if we should drop the "distinct" terminology and consistently use "unique". n_distinct() would become count_unique().