tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.77k stars 2.12k forks source link

Distinct without arguments on a grouped df #2476

Closed rubenarslan closed 7 years ago

rubenarslan commented 7 years ago

Before 0.5.0 distinct() on a df with groups had the same result as using unique(). Now, it shows unique groups.
This is not documented clearly in the help file or the release notes. I'd prefer if the behaviour were changed back, but alternatively, it should be mentioned in the help file and release notes.


This problem came up in the comments on #1981 again (which was closed, as it was about a different bug) and there it wasn't clear to @krlmlr that this was intended. Reading previous discussion at #2107 and #1110 it seems that this change was intended to make things more clear. I think at least a few users agree that this is not more clear, as it feels like two different operations are activated by the same call and it makes pipeline order matter unduly.

Five examples

0. old behaviour?

iris %>% group_by(Species) %>% distinct(.keep_all = TRUE) 
#> Source: local data frame [3 x 5]
#> Groups: Species [3]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>          <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
#> 1          5.1         3.5          1.4         0.2     setosa
#> 2          7.0         3.2          4.7         1.4 versicolor
#> 3          6.3         3.3          6.0         2.5  virginica
# first row for each species is kept, 3 rows

1. returns unique values of species, 3 rows

iris %>% group_by(Species) %>% distinct()  
#> Source: local data frame [3 x 1]
#> Groups: Species [3]
#> 
#>      Species
#>       <fctr>
#> 1     setosa
#> 2 versicolor
#> 3  virginica

2. distinct rows, species as a group, 149 rows

iris %>% distinct() %>% group_by(Species)
#> Source: local data frame [149 x 5]
#> Groups: Species [3]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> # ... with 139 more rows

3. returns distinct values of Sepal.Length within species

iris %>% group_by(Species) %>% distinct(Sepal.Length)  
#> Source: local data frame [57 x 2]
#> Groups: Species [3]
#> 
#>    Sepal.Length Species
#>           <dbl>  <fctr>
#> 1           5.1  setosa
#> 2           4.9  setosa
#> 3           4.7  setosa
#> 4           4.6  setosa
#> 5           5.0  setosa
#> 6           5.4  setosa
#> 7           4.4  setosa
#> 8           4.8  setosa
#> 9           4.3  setosa
#> 10          5.8  setosa
#> # ... with 47 more rows

4. same as 3.

iris %>% distinct(Species, Sepal.Length) %>% group_by(Species) 
#> Source: local data frame [57 x 2]
#> Groups: Species [3]
#> 
#>    Species Sepal.Length
#>     <fctr>        <dbl>
#> 1   setosa          5.1
#> 2   setosa          4.9
#> 3   setosa          4.7
#> 4   setosa          4.6
#> 5   setosa          5.0
#> 6   setosa          5.4
#> 7   setosa          4.4
#> 8   setosa          4.8
#> 9   setosa          4.3
#> 10  setosa          5.8
#> # ... with 47 more rows

Desired behaviour

Desired behaviour for distinct() on a grouped tbl what I'd expect based on close reading of docs (i.e. ... expands to all variables, including group).

iris %>% group_by(Species) %>% distinct(Sepal.Length, Sepal.Width, Petal.Length, 
  Petal.Width)
#> Source: local data frame [149 x 5]
#> Groups: Species [3]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> # ... with 139 more rows

Possible help files changes:

... : Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables. .keep_all: If TRUE, keep all variables in .data. If a combination of ... is not distinct, this keeps the first row of values.

Possible changes: If omitted on a tbl without groups, will use all variables. If omitted on a tbl with groups, will list unique groups, but will list unique combinations within group if specified.

.keep_all: Defaults to FALSE if ... is omitted on a non-grouped tbl.

Motivating use case to change it so that ... expands to all variables if omitted

I think, this is not the old behaviour or the behaviour that .keep_all = TRUE elicits, I think it can currently only be obtained by explicitly listing all variable names or by calling unique.

  1. Consistency with non-grouped tbl case. This would mean that all variables are checked for uniqueness (in each group/including group as a var), because ... expands to all variable names.
  2. Consistency with call with ... defined. Example 3 and 4 are equivalent, so I'd naïvely expect 1 and 2 to also be equivalent.
hadley commented 7 years ago

Would you mind explaining exactly how you think it should behave, and why? I think you're arguing that distinct() with no arguments should basically ignoring grouping?

Also please use the reprex package to construct reprexes that include the output. That makes it easier to read this issue without having to switch back and forth to R.

rubenarslan commented 7 years ago

Sorry, I didn't get that reprex was a reference to a package, not just a shorthand. Added it now.

Exact behaviour: I suggest distinct() without dots should behave as I understood the documentation ("If omitted, will use all variables."), i.e. use all variables, including the group variable. This would not be ignoring grouping, because getting all distinct rows in each group is the same as getting all distinct rows including the group var.

My last example was supposed to show what I'd prefer. It's the same behaviour you get when calling unique on a tbl.

hadley commented 7 years ago

Ok, thanks. I understand the problem, and I'll make the change in the next week or two (this need to wait until the tidyeval conversion is complete)

hadley commented 7 years ago

Just to be clear, you want these two calls to be equivalent, right?

iris %>% group_by(Species) %>% distinct()
iris %>% distinct() %>% group_by(Species)

So effectively, grouping has no impact on distinct(), except that the output will be grouped if the input is.

rubenarslan commented 7 years ago

That's correct.