tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

`cur_group_id()` gives wrong result when inside `summarize(mutate(...)` #6746

Closed multimeric closed 1 year ago

multimeric commented 1 year ago

Here's a very weird bug I just encountered. The idea is that we want to run a function per group, which mutates that group to add a new column involving i, the current group ID. What should happen is that we get "setosa1" "versicolor2" "virginica3" as the ID is incremented per group, but in reality they are all fixed at 1.

It's a very contrived example but it's the simplest way I've found to reproduce this bug. dplyr::group_map is probably a more elegant way to solve this but I dare not use experimental functions.

per_group = function(i, grp){
  dplyr::mutate(grp, name=paste0(Species, i))
}

iris |>
    dplyr::group_by(Species) |>
    dplyr::summarise(per_group(dplyr::cur_group_id(), dplyr::cur_data_all())) |>
    dplyr::pull(name) |>
    unique()
#> `summarise()` has grouped output by 'Species'. You can override using the
#> `.groups` argument.
#> [1] "setosa1"     "versicolor1" "virginica1"

Created on 2023-02-20 with reprex v2.0.2

What is super interesting to me is that if we force the promise i to evaluate, it works correctly, which to me implies that the fault lies in handling promises:

per_group = function(i, grp){
  force(i)
  dplyr::mutate(grp, name=paste0(Species, i))
}

iris |>
    dplyr::group_by(Species) |>
    dplyr::summarise(per_group(dplyr::cur_group_id(), dplyr::cur_data_all())) |>
    dplyr::pull(name) |>
    unique()
#> `summarise()` has grouped output by 'Species'. You can override using the
#> `.groups` argument.
#> [1] "setosa1"     "versicolor2" "virginica3"

Created on 2023-02-20 with reprex v2.0.2

Versions:

```R > sessionInfo() R version 4.2.2 (2022-10-31) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Monterey 12.6 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] rstudioapi_0.14 knitr_1.40 magrittr_2.0.3 tidyselect_1.2.0 [5] R6_2.5.1 rlang_1.0.6 fastmap_1.1.0 fansi_1.0.3 [9] highr_0.9 dplyr_1.0.10 tcltk_4.2.2 tools_4.2.2 [13] xfun_0.34 utf8_1.2.2 cli_3.4.1 DBI_1.1.3 [17] clipr_0.8.0 withr_2.5.0 htmltools_0.5.4 yaml_2.3.6 [21] assertthat_0.2.1 digest_0.6.30 tibble_3.1.8 lifecycle_1.0.3 [25] processx_3.8.0 callr_3.7.2 ps_1.7.2 vctrs_0.5.0 [29] fs_1.5.2 glue_1.6.2 evaluate_0.17 rmarkdown_2.17 [33] reprex_2.0.2 compiler_4.2.2 pillar_1.8.1 generics_0.1.3 [37] pkgconfig_2.0.3 ```
DavisVaughan commented 1 year ago

This isn't a bug, it is just an unfortunate side effect that can occur when nesting mutate() or summarise() calls. Luckily that isn't really all that common.

When you do

per_group = function(i, grp){
  dplyr::mutate(grp, name=paste0(Species, i))
}

data %>%
  dplyr::summarise(per_group(dplyr::cur_group_id(), dplyr::cur_data_all()))

i is bound to dplyr::cur_group_id() but the value of i isn't actually finalized until it is utilized in

dplyr::mutate(grp, name=paste0(Species, i))

At that point, i resolves to dplyr::cur_group_id() and that tries to evaluate, but at that point you are inside mutate(), so it uses the cur_group_id() related to grp, not the original data.

When you force it early with

per_group = function(i, grp){
  force(i)
  dplyr::mutate(grp, name=paste0(Species, i))
}

then the evaluation of cur_group_id() happens within the summarise() call instead, so it is tied to data as you expected.

This is just part of how R works, so I think it is out of scope to try and work around this here