tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.77k stars 2.12k forks source link

v1.1.0 runtime for case_when with grouping variable is slow #6674

Closed fawda123 closed 1 year ago

fawda123 commented 1 year ago

Using case_when in a mutate call with a grouping variable is much, much slower in v1.1.0 compared to v1.0.10. The code works but it's causing a tremendous slowdown in many of the packages I maintain (see here, many examples have elapsed time >5s).

Here's a reprex for v1.1.0.

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
    x = seq(1:n), 
    y = rnorm(n)
)

microbenchmark(
    dat %>% 
        group_by(x) %>% 
        mutate(
                 z = case_when(
                    y < 0 ~ '-',
                    T ~ '+', 
                 )
        ), 
    times = 100
)
#> Unit: seconds
#>                                                                        expr
#>  dat %>% group_by(x) %>% mutate(z = case_when(y < 0 ~ "-", T ~      "+", ))
#>       min       lq     mean   median       uq      max neval
#>  2.376748 2.537896 2.650869 2.625663 2.723655 3.170204   100

Created on 2023-02-01 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> - Session info -------------------------------------------------------------- #> hash: person in steamy room: medium-dark skin tone, goat, black small square #> #> setting value #> version R version 4.1.3 (2022-03-10) #> os Windows 10 x64 (build 22000) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/New_York #> date 2023-02-01 #> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.1.3) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.1.3) #> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.1.3) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.1.3) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.1.3) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.2) #> fs 1.6.0 2023-01-23 [1] CRAN (R 4.1.3) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.1.3) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.3) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.1.3) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.1.3) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.1.3) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3) #> microbenchmark * 1.4.9 2021-11-09 [1] CRAN (R 4.1.3) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.1.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.2) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.1.3) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.3) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.3) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.1.3) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.1.3) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.1.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.2) #> sessioninfo 1.2.1 2021-11-02 [1] CRAN (R 4.1.2) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.1.3) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.1.3) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.1.3) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.1.3) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.1.3) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.1.3) #> #> [1] C:/Users/mbeck/R/win-library #> [2] C:/Program Files/R/R-4.1.3/library #> #> ------------------------------------------------------------------------------ ```

And here's a reprex for v1.0.10 (note that the times for this one are in milliseconds, above was seconds).

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
    x = seq(1:n), 
    y = rnorm(n)
)

microbenchmark(
    dat %>% 
        group_by(x) %>% 
        mutate(
                 z = case_when(
                    y < 0 ~ '-',
                    T ~ '+', 
                 )
        ), 
    times = 100
)
#> Unit: milliseconds
#>                                                                        expr
#>  dat %>% group_by(x) %>% mutate(z = case_when(y < 0 ~ "-", T ~      "+", ))
#>       min       lq     mean  median       uq      max neval
#>  114.9103 120.9102 126.9423 123.889 128.7439 167.7735   100

Created on 2023-02-01 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> - Session info -------------------------------------------------------------- #> hash: open mailbox with raised flag, love-you gesture: medium skin tone, snowboarder: light skin tone #> #> setting value #> version R version 4.1.3 (2022-03-10) #> os Windows 10 x64 (build 22000) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/New_York #> date 2023-02-01 #> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.2) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.1.3) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.1.3) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.1.3) #> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.1.3) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.1.3) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.1.3) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.2) #> fs 1.6.0 2023-01-23 [1] CRAN (R 4.1.3) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.1.3) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.3) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.1.3) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.1.3) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.1.3) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3) #> microbenchmark * 1.4.9 2021-11-09 [1] CRAN (R 4.1.3) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.1.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.2) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.1.3) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.3) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.3) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.1.3) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.1.3) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.1.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.2) #> sessioninfo 1.2.1 2021-11-02 [1] CRAN (R 4.1.2) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.1.3) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.1.3) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.1.3) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.1.3) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.1.3) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.1.3) #> #> [1] C:/Users/mbeck/R/win-library #> [2] C:/Program Files/R/R-4.1.3/library #> #> ------------------------------------------------------------------------------ ```
jonspring commented 1 year ago

As Ritchie Sacramento noted in a related comment on Stack Overflow, the v1.0.10 example above is very likely not grouping the data, since the .by parameter was not yet incorporated. The comparison might be more instructive if that example were to use dat |> group_by(x) |> mutate(z = <etc>) instead.

dpprdan commented 1 year ago

Benchmarks with group_by() still show a sizeable time difference.

v1.0.10

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
  x = seq(1:n),
  y = rnorm(n)
)

microbenchmark(
  dat |>
    group_by(x) |>
    mutate(z = case_when(
      y < 0 ~ "-",
      T ~ "+",
    )),
  times = 100
)
#> Unit: milliseconds
#>                                                                  expr      min
#>  mutate(group_by(dat, x), z = case_when(y < 0 ~ "-", T ~ "+",      )) 161.7242
#>        lq     mean   median       uq     max neval
#>  176.3113 187.8724 180.7527 189.4069 371.132   100
Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language en #> collate German_Germany.utf8 #> ctype German_Germany.utf8 #> tz Europe/Berlin #> date 2023-02-01 #> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.2) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2) #> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.2) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.6.0 2023-01-23 [1] CRAN (R 4.2.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1) #> glue 1.6.2.9000 2023-01-16 [1] Github (tidyverse/glue@5a16502) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] RSPM #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> microbenchmark * 1.4.9 2021-11-09 [1] CRAN (R 4.2.2) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.1) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.2) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.2) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.1) #> tidyselect 1.2.0 2022-10-10 [1] RSPM #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.2) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2) #> #> [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.2/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

v1.1.0

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
  x = seq(1:n),
  y = rnorm(n)
)

microbenchmark(
  dat |>
    group_by(x) |>
    mutate(z = case_when(
      y < 0 ~ "-",
      T ~ "+",
    )),
  times = 100
)
#> Unit: seconds
#>                                                                  expr      min
#>  mutate(group_by(dat, x), z = case_when(y < 0 ~ "-", T ~ "+",      )) 3.671095
#>        lq     mean   median       uq      max neval
#>  3.853471 3.967937 3.911133 4.017493 4.799612   100

Created on 2023-02-01 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language en #> collate German_Germany.utf8 #> ctype German_Germany.utf8 #> tz Europe/Berlin #> date 2023-02-01 #> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.2) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2) #> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.2) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.6.0 2023-01-23 [1] CRAN (R 4.2.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1) #> glue 1.6.2.9000 2023-01-16 [1] Github (tidyverse/glue@5a16502) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] RSPM #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> microbenchmark * 1.4.9 2021-11-09 [1] CRAN (R 4.2.2) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.1) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.2) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.2) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.1) #> tidyselect 1.2.0 2022-10-10 [1] RSPM #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.2) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2) #> #> [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.2/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
fawda123 commented 1 year ago

Apologies, I updated the issue using group_by(). Similar results as @dpprdan.

hadley commented 1 year ago

When benchmarking a problem like this, you really want to separate the pieces. Is this a problem with mutate(), or is this a problem with case_when()? You example requires case_when() to work on a single observation at a time, which is not it's strength because it's designed to be vectorised. That suggest to me that a meaningful comparison would use a few vector lengths:

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.1.0'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)

bench::mark(
  y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
  y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
  y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 y1            892µs  977.4µs     949.     1.01MB     34.0
#> 2 y1e3        934.7µs  991.7µs     939.    65.37KB     34.0
#> 3 y1e6         50.1ms   75.7ms      14.6   61.04MB     23.7

Created on 2023-02-01 with reprex v2.0.2

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.10'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)

bench::mark(
  y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
  y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
  y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 y1           38.6µs   41.2µs   21814.      296KB     45.8
#> 2 y1e3         67.7µs   78.9µs   10627.     98.8KB     24.0
#> 3 y1e6         66.3ms   94.7ms      11.0    95.4MB     38.7

Created on 2023-02-01 with reprex v2.0.2

So that suggests that yes, using case_when() with a single observation has gotten significantly slower (maybe 800µs extra overhead), but it gets faster as the length of the vector increases.

I don't think your specific use case is a particularly compelling reason to re-consider case_when() performance, but the drop in speed at 1000 elements might suggest we should take a quick look to try and reduce some of the setup overhead.

r2evans commented 1 year ago

edit: @hadley, I was writing this before I saw your comment, sorry for the repetition. However, I argue even with 1000-long vectors (ungrouped), the 10x decrease (by n_itr) in case_when is significant.


I think it might be helpful to isolate this as two distinct slow-downs: case_when in isolation, and case_when within mutate. I think the use of group_by()/.by= is either a red herring (exacerbating the problem) or another change in performance.

Starting with data,

set.seed(42)
n <- 1000
y <- rnorm(n)
df <- tibble(y2 = y)

we see the following comparative performance:

packageVersion("dplyr")
# [1] '1.0.10'
bench::mark(
  "dplyr-1.0.10-case_when" = case_when(y < 0 ~ "-", TRUE ~ "+"),
  "dplyr-1.0.10-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", TRUE ~ "+")),
  min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
#   expression                  min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_t…¹ result memory     time       gc      
#   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>  <bch:tm> <list> <list>     <list>     <list>  
# 1 dplyr-1.0.10-case_when   89.5µs  102.9µs     8511.    98.8KB     6.50  3928     3     462ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.0.10-mutate      1.24ms   1.39ms      620.   100.3KB     2.49   498     2     803ms <NULL> <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable name ¹​total_time

### different R instance, same laptop, same R
packageVersion("dplyr")
# [1] '1.1.0'
bench::mark(
  "dplyr-1.1.0-case_when" = case_when(y < 0 ~ "-", .default = "+"),
  "dplyr-1.1.0-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", .default = "+")),
  min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
#   expression                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dplyr-1.1.0-case_when    1.5ms   1.63ms      595.    49.6KB     9.67   492     8   827.47ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.1.0-mutate      2.76ms   3.08ms      317.    58.5KB     8.46   487    13      1.54s <NULL> <Rprofmem> <bench_tm> <tibble>

I find it very interesting that the only code difference between the two dplyr versions are the change between TRUE ~ "+" and .default = "+", yet (a) case_when has a 10x performance difference, and (b) mutate + case_when is much less different. The n_itr is high enough that I suggest these results are credible (and I repeated each several times to make sure).

hadley commented 1 year ago

@r2evans I think mutate() is entirely a red herring. It just looks like we've gained ~800µs of overhead in case_when(), and that's impacting the run-time at smaller lengths (given the other evidence I'm pretty sure this is an additive change, not a multiplicative one). I agree it's worth looking into.

r2evans commented 1 year ago

I think the slowdown in mutate may be interesting by itself, but the initial reason for my comment (that trailed yours by moments) was to isolate what is likely the larger component. I'm hopeful that a much wider net of users (now that 1.1.0 has been formally released) will provide more context and use-cases to consider if/when/how this slowdown is approached. Thanks for the package, effort, and discourse @hadley

charliejhadley commented 1 year ago

I've just updated to {dplyr} v1.1.0 and have hit a very big slow down due to this issue. I think I have a useful demonstration issue and have presented a reprex.

I have data on the Top 100 UK songs every week from 2000 to 2023 which is 1119,000 rows of data with this format and 17,275 groups when grouped by id_title_artist.

# A tibble: 4 × 5
  date_week_start position_current position_next title                id_title_artist                status
  <date>                     <dbl>         <dbl> <chr>                          <int>                <chr>
1 1999-12-26                    49            40 1999                              89                "Re-release"
2 1999-12-26                    52            52 2 TIMES                          105                "New release"

My code was slowed by this issue because of the following bit of code:

the_data %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

To give some proper context to this, let's generate fake date for the top 10

library(tidyverse)
dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)

fake_data <- tibble(
  date_week_start = rep(dates,10),
) %>% 
  arrange(date_week_start) %>% 
  mutate(position_current = rep(1:10, n_dates),
         position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
         id_title_artist = sample(1:17275, 10 * n_dates),
         status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE)) 

fake_data
## A tibble: 12,020 × 5
#date_week_start position_current position_next id_title_artist status     
#<date>                     <int>         <int>           <int> <chr>      
#  1 1999-12-26                     1            67           11930 Consecutive
#2 1999-12-26                     2            38            5950 Consecutive
#3 1999-12-26                     3            NA            4878 Consecutive
#4 1999-12-26                     4            33            4589 New release
#5 1999-12-26                     5            86           13923 New release
#6 1999-12-26                     6            42           16232 Consecutive
#7 1999-12-26                     7            13            6975 Consecutive
#8 1999-12-26                     8            81            5723 Consecutive
#9 1999-12-26                     9            58            3404 Consecutive
#10 1999-12-26                    10            50           13796 Re-release 
## … with 12,010 more rows
## ℹ Use `print(n = ...)` to see more rows

Now my code is looking for re-releases but needs to make sure that songs released in the first week of data are handled differently. As this code is then functionalised to look at different ranges of data that's particularly important:

fake_data %>% 
  arrange(date_week_start) %>% 
  group_by(id_title_artist) %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

Reprex



library(tidyverse)
library(lubridate)

dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)

set.seed(1)
fake_data <- tibble(
  date_week_start = rep(dates,10),
) %>% 
  arrange(date_week_start) %>% 
  mutate(position_current = rep(1:10, n_dates),
         position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
         id_title_artist = sample(1:17275, 10 * n_dates),
         status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE)) 

fake_data %>% 
  arrange(date_week_start) %>% 
  group_by(id_title_artist) %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Monterey 12.5
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/London
#>  date     2023-02-06
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports       1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom           1.0.3   2023-01-25 [1] CRAN (R 4.2.0)
#>  cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli             3.4.1   2022-09-23 [1] CRAN (R 4.2.0)
#>  colorspace      2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon          1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr          2.3.0   2023-01-16 [1] CRAN (R 4.2.0)
#>  digest          0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate        0.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  fansi           1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats       * 1.0.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  fs              1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  gargle          1.2.1   2022-09-08 [1] CRAN (R 4.2.0)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.4.0   2022-11-04 [1] CRAN (R 4.2.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  googledrive     2.0.0   2021-07-08 [1] CRAN (R 4.2.0)
#>  googlesheets4   1.0.1   2022-08-13 [1] CRAN (R 4.2.0)
#>  gtable          0.3.1   2022-09-01 [1] CRAN (R 4.2.0)
#>  haven           2.5.1   2022-08-22 [1] CRAN (R 4.2.0)
#>  highr           0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms             1.1.2   2022-08-19 [1] CRAN (R 4.2.0)
#>  htmltools       0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr            1.4.4   2022-08-17 [1] CRAN (R 4.2.0)
#>  jsonlite        1.8.4   2022-12-06 [1] CRAN (R 4.2.0)
#>  knitr           1.39.6  2022-08-04 [1] Github (yihui/knitr@bebf67e)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  lubridate     * 1.9.1   2023-01-24 [1] CRAN (R 4.2.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr          0.1.10  2022-11-11 [1] CRAN (R 4.2.0)
#>  munsell         0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar          1.8.1   2022-08-19 [1] CRAN (R 4.2.0)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         * 1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache         0.15.0  2021-04-30 [1] CRAN (R 4.2.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils         2.12.0  2022-06-28 [1] CRAN (R 4.2.0)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr         * 2.1.3   2022-10-01 [1] CRAN (R 4.2.0)
#>  readxl          1.4.1   2022-08-17 [1] CRAN (R 4.2.0)
#>  reprex          2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang           1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  rstudioapi      0.14    2022-08-22 [1] CRAN (R 4.2.0)
#>  rvest           1.0.3   2022-08-19 [1] CRAN (R 4.2.0)
#>  scales          1.2.1   2022-08-20 [1] CRAN (R 4.2.0)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi         1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr       * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler          1.7.0   2022-03-13 [1] CRAN (R 4.2.0)
#>  tibble        * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
#>  tidyr         * 1.3.0   2023-01-24 [1] CRAN (R 4.2.0)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.2.1)
#>  tidyverse     * 1.3.2   2022-07-18 [1] CRAN (R 4.2.0)
#>  timechange      0.2.0   2023-01-11 [1] CRAN (R 4.2.0)
#>  tzdb            0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.5.2   2023-01-23 [1] CRAN (R 4.2.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun            0.35    2022-11-16 [1] CRAN (R 4.2.0)
#>  xml2            1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml            2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
courtiol commented 1 year ago

Although the following reprex combines different issues, it illustrates a slowdown of more than 50 x between dplyr 1.0.10 and 1.1 and brings this simple code to run in more than 2 seconds.

d <- data.frame(grp = rep(paste(1:500), each = 2),
                x = rep(c("A", "B"), each = 500))

library(dplyr)

d |> 
  group_by(grp) |> 
  summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))
LiamDBailey commented 1 year ago

To expand on the reprex from @courtiol. If we compare two approaches where we either use group_by()/summarise() before calling case_when() (case_when on a single vector, so more efficient) or use case_when() inside group_by()/summarise() (case_when run on multiple smaller vectors, less efficient). In v1.0.10, we'd see a slight difference in speed (~9x). In v1.1.0, there's now >50x difference.

In v1.0.10, case_when() inside group_by()/summarise() was a less efficient but viable approach and I was likely used quite often. The speed hit with case_when() for smaller vectors makes this approach seem no longer viable.

v1.0.10

d <- data.frame(grp = rep(paste(1:1000), each = 2),
                x = rep(c("A", "B"), each = 1000))

library(dplyr)
library(bench)

packageVersion("dplyr")
#> [1] '1.0.10'

mark(grouped = {d |> 
       group_by(grp) |> 
       summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
     ungrouped = {d |> 
       group_by(grp) |> 
       summarise(firstX = first(x), .groups = "drop") |>
       mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
       select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 grouped       116ms  122.5ms      7.99    2.67MB     22.0
#> 2 ungrouped      12ms   13.1ms     73.4     1.44MB     15.9

v1.1.0

d <- data.frame(grp = rep(paste(1:1000), each = 2),
                x = rep(c("A", "B"), each = 1000))

library(dplyr)
library(bench)

packageVersion("dplyr")
#> [1] '1.1.0'

mark(grouped = {d |> 
       group_by(grp) |> 
       summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
     ungrouped = {d |> 
       group_by(grp) |> 
       summarise(firstX = first(x), .groups = "drop") |>
       mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
       select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 grouped       4.32s    4.32s     0.231    5.73MB     21.7
#> 2 ungrouped   79.02ms  84.06ms    12.0      1.51MB     22.0
System info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.2.2 (2022-10-31 ucrt) #> os Windows 10 x64 (build 16299) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_World.1252 #> ctype English_World.1252 #> tz Europe/Berlin #> date 2023-02-08 #> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> bench * 1.1.2 2021-11-30 [1] CRAN (R 4.2.2) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.2) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2) #> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.2) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2) #> fs 1.6.0 2023-01-23 [1] CRAN (R 4.2.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.2) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.2) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.2) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2) #> #> [1] C:/Users/bailey/Documents/R/win-library/4.0 #> [2] C:/Program Files/R/R-4.2.2/library #> #> ------------------------------------------------------------------------------ ```
hadley commented 1 year ago

Yes, we know it’s slow and we’ll work on it. No need to keep providing reprexes that don’t add new insight to the problem.

r2evans commented 1 year ago

Thanks @DavisVaughan !