tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

Allow ungroup to specify removal of grouping variable #3760

Closed ggrothendieck closed 4 years ago

ggrothendieck commented 6 years ago

A common case is that one constructs a grouping variable in group_by but only needs it for the duration of the group_by so afterwards one must use select to get rid of it as in the example below. It would be pleasingly symmetric if ungroup could remove the added column just as group_by adds it so

ungroup(-g)

would be the same as

ungroup %>%
select(-g)

Thus in this example taken from https://stackoverflow.com/questions/51939874/referencing-previous-column-value-as-column-is-created/51940343#51940343

test <- structure(list(i = c(0, 1, 2, 3, 4, 0, 1, 2, 3, 4), chng = c(0, 
0.031, 0.005, -0.005, 0.017, 0, 0.012, 0.003, -0.013, -0.005), 
    indx = c(1, 1.031, 1.037, 1.031, 1.048, 1, 1.012, 1.015, 
    1.002, 0.997)), class = "data.frame", row.names = c(NA, -10L
))

test %>%
  group_by(g = cumsum(i == 0)) %>%
  mutate(indx = cumprod(chng + 1)) %>%
  ungroup %>%
  select(-g)

we could write using one fewer statement, i.e. the last two lines of code above are combined into the last line below.

test %>%
  group_by(g = cumsum(i == 0)) %>%
  mutate(indx = cumprod(chng + 1)) %>%
  ungroup(-g)

Note the reduced line count and improved symmetry.

romainfrancois commented 6 years ago

🤔 ungroup does have an ... it does not use:

> dplyr:::ungroup.grouped_df
function(x, ...) {
  ungroup_grouped_df(x)
}
<bytecode: 0x1026547e8>
<environment: namespace:dplyr>

but I'm not sure about having ungroup also perform selection

mkoohafkan commented 6 years ago

Seems to me that incorporating this kind of logic into https://github.com/tidyverse/dplyr/issues/3721 would be the better solution for this use case.

I do think it would be neat if ungroup could selectively remove some groupings but not others, e.g.

mtcars %>% group_by(gear, carb, cyl) %>% ungroup(cyl)

would be equivalent to

mtcars %>% group_by(gear, carb, cyl) %>% group_by(gear, carb)

which is how I first interpreted the title of this issue.

ggrothendieck commented 6 years ago

Here is another example taken from https://stackoverflow.com/questions/52906985/merging-of-duplicate-rows-that-have-misspelled-variables/52907932#52907932

library(phonics)
library(dplyr)

# create test data
Lines <- "CAR MPG
Mazda 5
Mazzda 2
Mzda 1"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)

# process
DF %>% 
  group_by(key = soundex(CAR)) %>%
  summarize(CAR = toString(CAR), MPG = sum(MPG)) %>%
  ungroup %>%
  select(-key)

With the feature under discussion this would simplify to the shorter and more symmetric:

DF %>% 
  group_by(key = soundex(CAR)) %>%
  summarize(CAR = toString(CAR), MPG = sum(MPG)) %>%
  ungroup(-key)
ggrothendieck commented 6 years ago

@mkoohafkan, The way group_by currently works is that if you want to incrementally add a variable specify group_by(new_var, add = TRUE).

I suppose there is the question of whether add=TRUE means add the variable to the group_by or really means modify the group_by and replace it with a new group_by. In this latter case it would make sense to write group_by(-cyl, add = TRUE) to remove cyl from the group_by while leaving the other group_by variables in effect rather than using ungroup for that.

Another possibility is to use ungroup(cyl, subtract = TRUE) for that analogously to group_by(new_var, add = TRUE).

One other point is that I don't think incrementally adding and removing parts of a group_by is that frequently encountered whereas I have repeated encountered the ungroup %>% select(-var) sequence.

mkoohafkan commented 6 years ago

@ggrothendieck thought about this more and I agree with your statements that

  1. using e.g. ungroup(cyl) to drop the column cyl is symmetric and
  2. using group_by(-cyl) to remove a column from an existing grouping would be a bit confusing with the existing add argument. If the add argument to group_by had originally been named update this would be syntactically cleaner, e.g. group_by(cyl, update = TRUE) and group_by(-cyl, update = TRUE).

ungroup(..., subtract = TRUE) looks like a good idea at first but... what would ungroup(cyl, subtract = FALSE) mean?

yutannihilation commented 6 years ago

group_by() has mutate semantics, not select semantics (c.f. https://dplyr.tidyverse.org/articles/dplyr.html#selecting-operations). I guess you already noticed this when you tried group_by(-cyl, add = TRUE) and saw -cyl became the grouping variable.

dplyr::group_by(mtcars, -cyl)
#> # A tibble: 32 x 12
#> # Groups:   -cyl [3]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb `-cyl`
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     -6
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     -6
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     -4
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     -6
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     -8
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     -6
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     -8
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     -4
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     -4
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     -6
#> # ... with 22 more rows

Created on 2018-10-31 by the reprex package (v0.2.1)

So, to me, ungroup() should have mutate semantics as well for consistency (though I don't know what it means to mutate when ungrouping...). A possible solution is to implement scoped variants for ungroup()? (e.g. ungroup_at())?

ggrothendieck commented 6 years ago

Here is another case where this feature could be used taken from https://stackoverflow.com/questions/53240324/dplyr-collapse-tail-rows-into-larger-groups/53240699#53240699 In this case we are manufacturing a sort key in order to keep the table in its original sorted order. With the feature underdiscussion the select at the end of the code could be combined into the ungroup and so omitted.

Note how this keeps coming up again and again.

df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>% 
  group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
  summarize(b = sum(b)) %>%
  ungroup %>%
  select(-sortkey)
maxmoro commented 5 years ago

Having a selective ungroup is also very import when calculating percentages of subgroups.

mtcars %>%
  group_by(gear,carb,vs) %>%
  summarise(count=n()) %>%
  group_by(gear,carb) %>% #<< would be better to do ungroup(cyl)
  mutate(perc=count/sum(count)) %>%
  ungroup() %>%
  spread(vs,perc,sep='=')

    gear  carb count `vs=0` `vs=1`
   <dbl> <dbl> <int>  <dbl>  <dbl>
 1     3     1     3   NA      1  
 2     3     2     4    1     NA  
 3     3     3     3    1     NA  
 4     3     4     5    1     NA  
 5     4     1     4   NA      1  
 6     4     2     4   NA      1  
 7     4     4     2    0.5    0.5
 8     5     2     1    0.5    0.5
 9     5     4     1    1     NA  
hadley commented 4 years ago

I think it would be fine for ungroup() to have select semantics even while group() has action semantics. I'd suggest df %>% ungroup() would continue to work as usual, and df %>% ungroup(x) would remove x from the grouping variables, throwing an error if not currently grouped by x.

lock[bot] commented 4 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/