tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.77k stars 2.12k forks source link

Function using tidyeval works the first time but then errors on subsequent calls with same data frame and same environment #5775

Closed eipi10 closed 3 years ago

eipi10 commented 3 years ago

I'm having a bizarre problem in which a tidyeval function I wrote works fine the first time I run it with a particular data frame, but usually produces an error on subsequent attempts. I've provided two reprexes below, just to show a couple of different failure modes. This seems like a bug, but maybe there's a problem with my function.

I posted this as a question on RStudio Community. The lone responder thought he remembered a github issue on this, but I haven't been able to find one.

library(tidyverse)

fnc = function(data, value.vars, group.vars=NULL) {
  data %>% 
    group_by(across({{group.vars}})) %>% 
    summarise(n=n(), across({{value.vars}}, 
                            list(mean=~mean(.x, na.rm=TRUE),
                                 n.miss=~sum(is.na(.x))), 
                            .names="{.fn}_{.col}"))
}

mtcars %>% fnc(mpg)
#> # A tibble: 1 x 3
#>       n mean_mpg n.miss_mpg
#>   <int>    <dbl>      <int>
#> 1    32     20.1          0

iris %>% fnc(c(Petal.Width, Sepal.Width), Species)
#> # A tibble: 3 x 6
#>   Species     n mean_Petal.Width n.miss_Petal.Wi… mean_Sepal.Width
#> * <fct>   <int>            <dbl>            <int>            <dbl>
#> 1 setosa     50            0.246                0             3.43
#> 2 versic…    50            1.33                 0             2.77
#> 3 virgin…    50            2.03                 0             2.97
#> # … with 1 more variable: n.miss_Sepal.Width <int>

diamonds %>% fnc(c(x,y), c(cut, color))
#> `summarise()` has grouped output by 'cut'. You can override using the `.groups` argument.
#> # A tibble: 35 x 7
#> # Groups:   cut [5]
#>    cut   color     n mean_x n.miss_x mean_y n.miss_y
#>    <ord> <ord> <int>  <dbl>    <int>  <dbl>    <int>
#>  1 Fair  D       163   6.02        0   5.96        0
#>  2 Fair  E       224   5.91        0   5.86        0
#>  3 Fair  F       312   5.99        0   5.93        0
#>  4 Fair  G       314   6.17        0   6.11        0
#>  5 Fair  H       303   6.58        0   6.50        0
#>  6 Fair  I       175   6.56        0   6.49        0
#>  7 Fair  J       119   6.75        0   6.68        0
#>  8 Good  D       662   5.62        0   5.63        0
#>  9 Good  E       933   5.62        0   5.63        0
#> 10 Good  F       909   5.69        0   5.71        0
#> # … with 25 more rows

iris %>% fnc(c(Petal.Width, Sepal.Width), Species)
#> Error: Can't subset elements that don't exist.
#> x Location 35 doesn't exist.
#> ℹ There are only 3 elements.

diamonds %>% fnc(c(x,y))
#> Error: Problem with `summarise()` input `..2`.
#> x subscript out of bounds
#> ℹ Input `..2` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.

Created on 2021-02-18 by the reprex package (v1.0.0)

library(tidyverse)

fnc = function(data, value.vars, group.vars=NULL) {
  data %>% 
    group_by(across({{group.vars}})) %>% 
    summarise(n=n(), across({{value.vars}}, 
                            list(mean=~mean(.x, na.rm=TRUE),
                                 n.miss=~sum(is.na(.x))), 
                            .names="{.fn}_{.col}"))
}

diamonds %>% fnc(c(x,y))
#> # A tibble: 1 x 5
#>       n mean_x n.miss_x mean_y n.miss_y
#>   <int>  <dbl>    <int>  <dbl>    <int>
#> 1 53940   5.73        0   5.73        0

mtcars %>% fnc(mpg)
#> # A tibble: 1 x 3
#>       n mean_mpg n.miss_mpg
#>   <int>    <dbl>      <int>
#> 1    32     20.1          0

iris %>% fnc(c(Petal.Width, Sepal.Width), Species)
#> # A tibble: 3 x 6
#>   Species     n mean_Petal.Width n.miss_Petal.Wi… mean_Sepal.Width
#> * <fct>   <int>            <dbl>            <int>            <dbl>
#> 1 setosa     50            0.246                0             3.43
#> 2 versic…    50            1.33                 0             2.77
#> 3 virgin…    50            2.03                 0             2.97
#> # … with 1 more variable: n.miss_Sepal.Width <int>

diamonds %>% fnc(c(x,y), c(cut, color))
#> `summarise()` has grouped output by 'cut'. You can override using the `.groups` argument.
#> # A tibble: 35 x 7
#> # Groups:   cut [5]
#>    cut   color     n mean_x n.miss_x mean_y n.miss_y
#>    <ord> <ord> <int>  <dbl>    <int>  <dbl>    <int>
#>  1 Fair  D       163   6.02        0   5.96        0
#>  2 Fair  E       224   5.91        0   5.86        0
#>  3 Fair  F       312   5.99        0   5.93        0
#>  4 Fair  G       314   6.17        0   6.11        0
#>  5 Fair  H       303   6.58        0   6.50        0
#>  6 Fair  I       175   6.56        0   6.49        0
#>  7 Fair  J       119   6.75        0   6.68        0
#>  8 Good  D       662   5.62        0   5.63        0
#>  9 Good  E       933   5.62        0   5.63        0
#> 10 Good  F       909   5.69        0   5.71        0
#> # … with 25 more rows

iris %>% fnc(c(Petal.Width, Sepal.Width), Species)
#> Error: Can't subset elements that don't exist.
#> x Location 35 doesn't exist.
#> ℹ There are only 3 elements.

mtcars %>% fnc(mpg, cyl)
#> Error: Can't subset elements that don't exist.
#> x Location 35 doesn't exist.
#> ℹ There are only 3 elements.

diamonds %>% fnc(c(x,y), color)
#> Error: Can't subset elements that don't exist.
#> x Location 35 doesn't exist.
#> ℹ There are only 7 elements.

Created on 2021-02-18 by the reprex package (v1.0.0)

eipi10 commented 3 years ago

Additional responses on RStudio Community (one by @szimmer, who filed issue #5733) indicate my issue is likely the same problem described in #5733, #5739, and #5765. My reprex above was run with dplyr 1.0.4. After reading that this issue appears to be fixed in the development version, I installed the development version and the problem went away.

In case it might help in understanding this bug: With dplyr 1.0.4 I ran sessionInfo() before the first time I used my function and then again after (see reprex below). It turns out that two additional packages, fansi and utf8, are loaded into the namespace after the function is run for the first time. In the first call to sessionInfo(), you can see that there are 49 packages loaded via namespace. In the second call to sessionInfo(), you can see that there are now 51 packages, and the two new ones are at positions 35 and 47.

library(tidyverse)

fnc = function(data, value.vars, group.vars=NULL) {
  data %>% 
    group_by(across({{group.vars}})) %>% 
    summarise(n=n(), across({{value.vars}}, 
                            list(mean=~mean(.x, na.rm=TRUE),
                                 n.miss=~sum(is.na(.x))), 
                            .names="{.fn}_{.col}"))
}

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4     purrr_0.3.4    
#> [5] readr_1.4.0     tidyr_1.1.2     tibble_3.0.6    ggplot2_3.3.3  
#> [9] tidyverse_1.3.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6        cellranger_1.1.0  pillar_1.4.7      compiler_4.0.3   
#>  [5] dbplyr_2.1.0      highr_0.8         tools_4.0.3       digest_0.6.27    
#>  [9] lubridate_1.7.9.2 jsonlite_1.7.2    evaluate_0.14     lifecycle_0.2.0  
#> [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10      reprex_1.0.0     
#> [17] cli_2.3.0         DBI_1.1.1         yaml_2.2.1        haven_2.3.1      
#> [21] xfun_0.20         withr_2.4.1       xml2_1.3.2        httr_1.4.2       
#> [25] styler_1.3.2      knitr_1.31        hms_1.0.0         generics_0.1.0   
#> [29] fs_1.5.0          vctrs_0.3.6       grid_4.0.3        tidyselect_1.1.0 
#> [33] glue_1.4.2        R6_2.5.0          readxl_1.3.1      rmarkdown_2.6    
#> [37] modelr_0.1.8      magrittr_2.0.1    backports_1.2.1   scales_1.1.1     
#> [41] ellipsis_0.3.1    htmltools_0.5.1.1 rvest_0.3.6       assertthat_0.2.1 
#> [45] colorspace_2.0-0  stringi_1.5.3     munsell_0.5.0     broom_0.7.4      
#> [49] crayon_1.4.0

mtcars %>% fnc(mpg)
#> # A tibble: 1 x 3
#>       n mean_mpg n.miss_mpg
#>   <int>    <dbl>      <int>
#> 1    32     20.1          0

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4     purrr_0.3.4    
#> [5] readr_1.4.0     tidyr_1.1.2     tibble_3.0.6    ggplot2_3.3.3  
#> [9] tidyverse_1.3.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6        cellranger_1.1.0  pillar_1.4.7      compiler_4.0.3   
#>  [5] dbplyr_2.1.0      highr_0.8         tools_4.0.3       digest_0.6.27    
#>  [9] lubridate_1.7.9.2 jsonlite_1.7.2    evaluate_0.14     lifecycle_0.2.0  
#> [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10      reprex_1.0.0     
#> [17] cli_2.3.0         DBI_1.1.1         yaml_2.2.1        haven_2.3.1      
#> [21] xfun_0.20         withr_2.4.1       xml2_1.3.2        httr_1.4.2       
#> [25] styler_1.3.2      knitr_1.31        hms_1.0.0         generics_0.1.0   
#> [29] fs_1.5.0          vctrs_0.3.6       grid_4.0.3        tidyselect_1.1.0 
#> [33] glue_1.4.2        R6_2.5.0          fansi_0.4.2       readxl_1.3.1     
#> [37] rmarkdown_2.6     modelr_0.1.8      magrittr_2.0.1    backports_1.2.1  
#> [41] scales_1.1.1      ellipsis_0.3.1    htmltools_0.5.1.1 rvest_0.3.6      
#> [45] assertthat_0.2.1  colorspace_2.0-0  utf8_1.1.4        stringi_1.5.3    
#> [49] munsell_0.5.0     broom_0.7.4       crayon_1.4.0
romainfrancois commented 3 years ago

I believe this was already fixed as part of to be released 1.0.5, in #5765. I'm now getting:

library(tidyverse)

fnc = function(data, value.vars, group.vars=NULL) {
  data %>% 
    group_by(across({{group.vars}})) %>% 
    summarise(n=n(), across({{value.vars}}, 
                            list(mean=~mean(.x, na.rm=TRUE),
                                 n.miss=~sum(is.na(.x))), 
                            .names="{.fn}_{.col}"))
}

mtcars %>% fnc(mpg)
#> # A tibble: 1 x 3
#>       n mean_mpg n.miss_mpg
#>   <int>    <dbl>      <int>
#> 1    32     20.1          0
iris %>% fnc(c(Petal.Width, Sepal.Width), Species)
#> # A tibble: 3 x 6
#>   Species     n mean_Petal.Width n.miss_Petal.Wi… mean_Sepal.Width
#>   <fct>   <int>            <dbl>            <int>            <dbl>
#> 1 setosa     50            0.246                0             3.43
#> 2 versic…    50            1.33                 0             2.77
#> 3 virgin…    50            2.03                 0             2.97
#> # … with 1 more variable: n.miss_Sepal.Width <int>
diamonds %>% fnc(c(x,y), c(cut, color))
#> `summarise()` has grouped output by 'cut'. You can override using the `.groups` argument.
#> # A tibble: 35 x 7
#> # Groups:   cut [5]
#>    cut   color     n mean_x n.miss_x mean_y n.miss_y
#>    <ord> <ord> <int>  <dbl>    <int>  <dbl>    <int>
#>  1 Fair  D       163   6.02        0   5.96        0
#>  2 Fair  E       224   5.91        0   5.86        0
#>  3 Fair  F       312   5.99        0   5.93        0
#>  4 Fair  G       314   6.17        0   6.11        0
#>  5 Fair  H       303   6.58        0   6.50        0
#>  6 Fair  I       175   6.56        0   6.49        0
#>  7 Fair  J       119   6.75        0   6.68        0
#>  8 Good  D       662   5.62        0   5.63        0
#>  9 Good  E       933   5.62        0   5.63        0
#> 10 Good  F       909   5.69        0   5.71        0
#> # … with 25 more rows
iris %>% fnc(c(Petal.Width, Sepal.Width), Species)
#> # A tibble: 3 x 6
#>   Species     n mean_Petal.Width n.miss_Petal.Wi… mean_Sepal.Width
#>   <fct>   <int>            <dbl>            <int>            <dbl>
#> 1 setosa     50            0.246                0             3.43
#> 2 versic…    50            1.33                 0             2.77
#> 3 virgin…    50            2.03                 0             2.97
#> # … with 1 more variable: n.miss_Sepal.Width <int>
diamonds %>% fnc(c(x,y))
#> # A tibble: 1 x 5
#>       n mean_x n.miss_x mean_y n.miss_y
#>   <int>  <dbl>    <int>  <dbl>    <int>
#> 1 53940   5.73        0   5.73        0

Created on 2021-03-04 by the reprex package (v0.3.0)