tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

Feature request: raise error for bad group_by column name #3289

Closed karldw closed 5 years ago

karldw commented 6 years ago

With the new tidyeval setup, some dplyr verbs can take strings as arguments. For example, select(mtcars, "cyl") works as expected, no extra !! or sym needed. That's not true for group_by (or many of the other dplyr verbs).
However, I would love for dplyr to raise an error when an inappropriate column name is provided. I've found this mistake is particularly easy to make with rlang's quoting/unquoting/symbol-creating.

"Inappropriate" is a little nebulous here, since group_by can accept arbitrary R code for in-memory operations. A heuristic that gets most of the way there would be to check if the group_by expression resolves to a character vector whose length is not equal to the number of rows.

What do you think?

library(dplyr)
x <- "cyl"

# This is great, raising an error as expected:
group_by(mtcars, x)
#> Error in grouped_df_impl(data, unname(vars), drop) :
#>  Column `x` is unknown

# I think this is the correct way to have dplyr interpret values:
group_by(mtcars, !!rlang::sym(x))
#> # A tibble: 32 x 11
#> # Groups:   cyl [3]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21.0  6.00   160 110    3.90  2.62  16.5  0     1.00  4.00  4.00
# ...

# Oh no, we silently added a new, constant column with a weird name!
# This is the one I would like to raise an error.
group_by(mtcars, !!x)
#> # A tibble: 32 x 12
#> # Groups:   "cyl" [1]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb `"cyl"`
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#>  1  21.0  6.00   160 110    3.90  2.62  16.5  0     1.00  4.00  4.00 cyl
# ...

# Same as previous, it'd be nice to have an error here too.
group_by(mtcars, "cyl")

# But it's hard because you currently can write code like this:
group_by(mtcars, signif(mpg, 1))            #  3 groups
group_by(mtcars, 1:32)                      # 32 groups
group_by(mtcars, c(letters, letters[1:6]))  # 26 groups

This is spiritually similar to https://github.com/tidyverse/dplyr/issues/3140.

karldw commented 6 years ago

More precisely, is this worth checking in the group_by code, or would it just be something to mention in the docs?
Thanks!

krlmlr commented 6 years ago

Thanks. I think quo() is a better way to work with column names which aren't bound to a data source yet:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
x <- rlang::quo(cyl)
group_by(mtcars, !! x)
#> # A tibble: 32 x 11
#> # Groups:   cyl [3]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21.0  6.00   160 110    3.90  2.62  16.5  0     1.00  4.00  4.00
#>  2  21.0  6.00   160 110    3.90  2.88  17.0  0     1.00  4.00  4.00
#>  3  22.8  4.00   108  93.0  3.85  2.32  18.6  1.00  1.00  4.00  1.00
#>  4  21.4  6.00   258 110    3.08  3.22  19.4  1.00  0     3.00  1.00
#>  5  18.7  8.00   360 175    3.15  3.44  17.0  0     0     3.00  2.00
#>  6  18.1  6.00   225 105    2.76  3.46  20.2  1.00  0     3.00  1.00
#>  7  14.3  8.00   360 245    3.21  3.57  15.8  0     0     3.00  4.00
#>  8  24.4  4.00   147  62.0  3.69  3.19  20.0  1.00  0     4.00  2.00
#>  9  22.8  4.00   141  95.0  3.92  3.15  22.9  1.00  0     4.00  2.00
#> 10  19.2  6.00   168 123    3.92  3.44  18.3  1.00  0     4.00  4.00
#> # ... with 22 more rows

Created on 2018-01-08 by the reprex package (v0.1.1.9000).

I agree that a warning would be useful, maybe dplyr could detect (at the C++ level) that the grouping expression is a constant? What wording would you suggest for the warning?

karldw commented 6 years ago

Thanks!

When to warn

group_by already counts the number of groups, so detecting that it's constant should be easy, right? Do you want to raise a warning in cases where the table has fewer than two rows?

Warning message

For the warning message, how about: "Only one group was defined by `{group_expression}`. This is usually a mistake."?

Overriding the warning

I'm sure there are some valid cases where people are grouping by a constant. Do you want to add a way to override the warning? If so, should the override be a global option or a flag in the group_by function?

With an override, the warning message could instead be: "Only one group was defined by `{group_expression}`. This is usually a mistake. Set warn_group_by_const = FALSE to suppress this warning."

I'm hesitant about an argument or option to suppress the warning, both because it's a more complicated UI and because it makes the data frame call different than the database call. suppressWarnings() is always an option for the user.

krlmlr commented 6 years ago

Thanks. Just counting the number of groups doesn't feel sufficient, because this may be a legitimate use case for scripts that work with different data inputs, or with filter criteria provided by the user. But we could warn if one group is returned and all grouping expressions are constants. To override, users will need to add a function call, like identity("cyl"); these uses will be rare so don't warrant an option or argument.

So the warning could be:

Only one group was defined by {group_expression}, ..., because it is a constant expression. Did you mean {unquoted_group_expression}, ...?

We should omit the second part if we can't find columns of that name.

What do you think?

karldw commented 6 years ago

That sounds great!

krlmlr commented 6 years ago

Prioritized, because the following code behaved differently in dplyr 0.7.4:

dplyr::distinct(iris, "Species")
#>   "Species"
#> 1   Species

Created on 2018-03-26 by the reprex package (v0.2.0).

The current behavior would be expected now (and was like this in dplyr 0.5.0), but needs a warning. I think we could use quo_is_symbolic() here:

library(tidyverse)
f <- function(...) {
  map_lgl(quos(...), rlang::quo_is_symbolic)
}

x <- "x"
y <- quote(y)

f("a", b, !!x, !!y)
#>                         
#> FALSE  TRUE FALSE  TRUE

Created on 2018-03-26 by the reprex package (v0.2.0).

krlmlr commented 6 years ago

Added a NEWS entry for now.

hadley commented 5 years ago

I don't think customising group_by()'s use of mutate() would be a good idea. In my experience, it is better to leave the general principles alone. Yes, they sometimes create misleading results, but adding more special cases typically leads to more problems, not fewer.

lock[bot] commented 4 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/