Closed rubenarslan closed 7 years ago
Would you mind explaining exactly how you think it should behave, and why? I think you're arguing that distinct()
with no arguments should basically ignoring grouping?
Also please use the reprex package to construct reprexes that include the output. That makes it easier to read this issue without having to switch back and forth to R.
Sorry, I didn't get that reprex was a reference to a package, not just a shorthand. Added it now.
Exact behaviour:
I suggest distinct()
without dots should behave as I understood the documentation ("If omitted, will use all variables."), i.e. use all variables, including the group variable.
This would not be ignoring grouping, because getting all distinct rows in each group is the same as getting all distinct rows including the group var.
My last example was supposed to show what I'd prefer. It's the same behaviour you get when calling unique
on a tbl.
Ok, thanks. I understand the problem, and I'll make the change in the next week or two (this need to wait until the tidyeval conversion is complete)
Just to be clear, you want these two calls to be equivalent, right?
iris %>% group_by(Species) %>% distinct()
iris %>% distinct() %>% group_by(Species)
So effectively, grouping has no impact on distinct()
, except that the output will be grouped if the input is.
That's correct.
Before 0.5.0
distinct()
on a df with groups had the same result as usingunique()
. Now, it shows unique groups.This is not documented clearly in the help file or the release notes. I'd prefer if the behaviour were changed back, but alternatively, it should be mentioned in the help file and release notes.
This problem came up in the comments on #1981 again (which was closed, as it was about a different bug) and there it wasn't clear to @krlmlr that this was intended. Reading previous discussion at #2107 and #1110 it seems that this change was intended to make things more clear. I think at least a few users agree that this is not more clear, as it feels like two different operations are activated by the same call and it makes pipeline order matter unduly.
Five examples
0. old behaviour?
1. returns unique values of species, 3 rows
2. distinct rows, species as a group, 149 rows
3. returns distinct values of Sepal.Length within species
4. same as 3.
Desired behaviour
Desired behaviour for distinct() on a grouped tbl what I'd expect based on close reading of docs (i.e.
...
expands to all variables, including group).Possible help files changes:
... : Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables. .keep_all: If TRUE, keep all variables in .data. If a combination of ... is not distinct, this keeps the first row of values.
Possible changes: If omitted on a tbl without groups, will use all variables. If omitted on a tbl with groups, will list unique groups, but will list unique combinations within group if specified.
.keep_all: Defaults to FALSE if ... is omitted on a non-grouped tbl.
Motivating use case to change it so that
...
expands to all variables if omittedI think, this is not the old behaviour or the behaviour that
.keep_all = TRUE
elicits, I think it can currently only be obtained by explicitly listing all variable names or by callingunique
....
expands to all variable names....
defined. Example 3 and 4 are equivalent, so I'd naïvely expect 1 and 2 to also be equivalent.