tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

row_number() does not repsect .by= in mutate #7075

Closed ggrothendieck closed 2 months ago

ggrothendieck commented 2 months ago

row_number() ignores .by= in mutate

dat <- data.frame(x = head(letters, 6), y = LETTERS[1:2])
dat %>% 
  mutate(z = first(row_number()), .by = y)
##   x y z
## 1 a A 1
## 2 b B 1
## 3 c A 1
## 4 d B 1
## 5 e A 1
## 6 f B 1

I would have expected the same output as

dat %>% 
  mutate(r = row_number()) %>%
  mutate(z = first(r), .by = y) %>%
  select(-r)
##   x y z
## 1 a A 1
## 2 b B 2
## 3 c A 1
## 4 d B 2
## 5 e A 1
## 6 f B 2
DavisVaughan commented 2 months ago

Everything looks to be working as expected here.

In this case, the first mutate generates 1:6 because it is ungrouped. The second mutate calls first() 2 times, once on the vector c(1, 3, 5), i.e. the A group, and once on c(2, 4, 6), i.e. the B group. So you get 1 and 2 as your results, recycled to the group size.

dat %>% 
  mutate(r = row_number()) %>%
  mutate(z = first(r), .by = y)
#>   x y r z
#> 1 a A 1 1
#> 2 b B 2 2
#> 3 c A 3 1
#> 4 d B 4 2
#> 5 e A 5 1
#> 6 f B 6 2

In this case, row_number() is computed 2 times, first for the A group of y, so you get c(1, 2, 3). And then again for the B group of y, so you again get c(1, 2, 3) within that group. Then you just take the first() of both of those vectors which is why you see 1 everywhere.

dat %>% 
  mutate(z = first(row_number()), .by = y)
#>   x y z
#> 1 a A 1
#> 2 b B 1
#> 3 c A 1
#> 4 d B 1
#> 5 e A 1
#> 6 f B 1

It doesn't have anything to do with .by, this is also how group_by() has always worked with row_number()

dat %>% 
  group_by(y) %>%
  mutate(z = row_number())
#> # A tibble: 6 × 3
#> # Groups:   y [2]
#>   x     y         z
#>   <chr> <chr> <int>
#> 1 a     A         1
#> 2 b     B         1
#> 3 c     A         2
#> 4 d     B         2
#> 5 e     A         3
#> 6 f     B         3