Closed karldw closed 5 years ago
More precisely, is this worth checking in the group_by
code, or would it just be something to mention in the docs?
Thanks!
Thanks. I think quo()
is a better way to work with column names which aren't bound to a data source yet:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
x <- rlang::quo(cyl)
group_by(mtcars, !! x)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6.00 160 110 3.90 2.62 16.5 0 1.00 4.00 4.00
#> 2 21.0 6.00 160 110 3.90 2.88 17.0 0 1.00 4.00 4.00
#> 3 22.8 4.00 108 93.0 3.85 2.32 18.6 1.00 1.00 4.00 1.00
#> 4 21.4 6.00 258 110 3.08 3.22 19.4 1.00 0 3.00 1.00
#> 5 18.7 8.00 360 175 3.15 3.44 17.0 0 0 3.00 2.00
#> 6 18.1 6.00 225 105 2.76 3.46 20.2 1.00 0 3.00 1.00
#> 7 14.3 8.00 360 245 3.21 3.57 15.8 0 0 3.00 4.00
#> 8 24.4 4.00 147 62.0 3.69 3.19 20.0 1.00 0 4.00 2.00
#> 9 22.8 4.00 141 95.0 3.92 3.15 22.9 1.00 0 4.00 2.00
#> 10 19.2 6.00 168 123 3.92 3.44 18.3 1.00 0 4.00 4.00
#> # ... with 22 more rows
Created on 2018-01-08 by the reprex package (v0.1.1.9000).
I agree that a warning would be useful, maybe dplyr could detect (at the C++ level) that the grouping expression is a constant? What wording would you suggest for the warning?
Thanks!
group_by
already counts the number of groups, so detecting that it's constant should be easy, right?
Do you want to raise a warning in cases where the table has fewer than two rows?
For the warning message, how about: "Only one group was defined by `{group_expression}`. This is usually a mistake."?
I'm sure there are some valid cases where people are grouping by a constant. Do you want to add a way to override the warning? If so, should the override be a global option or a flag in the group_by
function?
With an override, the warning message could instead be: "Only one group was defined by `{group_expression}`. This is usually a mistake. Set warn_group_by_const = FALSE to suppress this warning."
I'm hesitant about an argument or option to suppress the warning, both because it's a more complicated UI and because it makes the data frame call different than the database call. suppressWarnings()
is always an option for the user.
Thanks. Just counting the number of groups doesn't feel sufficient, because this may be a legitimate use case for scripts that work with different data inputs, or with filter criteria provided by the user. But we could warn if one group is returned and all grouping expressions are constants. To override, users will need to add a function call, like identity("cyl")
; these uses will be rare so don't warrant an option or argument.
So the warning could be:
Only one group was defined by {group_expression}
, ..., because it is a constant expression. Did you mean {unquoted_group_expression}
, ...?
We should omit the second part if we can't find columns of that name.
What do you think?
That sounds great!
Prioritized, because the following code behaved differently in dplyr 0.7.4:
dplyr::distinct(iris, "Species")
#> "Species"
#> 1 Species
Created on 2018-03-26 by the reprex package (v0.2.0).
The current behavior would be expected now (and was like this in dplyr 0.5.0), but needs a warning. I think we could use quo_is_symbolic()
here:
library(tidyverse)
f <- function(...) {
map_lgl(quos(...), rlang::quo_is_symbolic)
}
x <- "x"
y <- quote(y)
f("a", b, !!x, !!y)
#>
#> FALSE TRUE FALSE TRUE
Created on 2018-03-26 by the reprex package (v0.2.0).
Added a NEWS entry for now.
I don't think customising group_by()
's use of mutate()
would be a good idea. In my experience, it is better to leave the general principles alone. Yes, they sometimes create misleading results, but adding more special cases typically leads to more problems, not fewer.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
With the new tidyeval setup, some dplyr verbs can take strings as arguments. For example,
select(mtcars, "cyl")
works as expected, no extra!!
orsym
needed. That's not true forgroup_by
(or many of the other dplyr verbs).However, I would love for dplyr to raise an error when an inappropriate column name is provided. I've found this mistake is particularly easy to make with rlang's quoting/unquoting/symbol-creating.
"Inappropriate" is a little nebulous here, since
group_by
can accept arbitrary R code for in-memory operations. A heuristic that gets most of the way there would be to check if thegroup_by
expression resolves to a character vector whose length is not equal to the number of rows.What do you think?
This is spiritually similar to https://github.com/tidyverse/dplyr/issues/3140.