tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.77k stars 2.12k forks source link

mutate_at() is quite slow compared to base `apply()` #4159

Closed stufield closed 5 years ago

stufield commented 5 years ago

The performance effects of tidyeval on mutate_at() seem known, see #2813, but from this past issue I understood improvements in rlang to have fixed (mitigated?) the issue. However from the reprex below, the issue persists and is more of an issue than I thought (compared to apply()).

I often work with big data containing 5000 - 10000 variables and typically want to transform a subset of them by column name. I would prefer to use members of the tidyverse, but in this use case, is there a better alternative to base R?

suppressMessages(library(dplyr))
library(purrr)
library(bench)
library(tibble)
set.seed(101)

# Create tibble with `n` columns, each with 100 Gaussian mean = 100
# Name p1 -> p_n
n  <- 2500
df <- rerun(n, rnorm(100, mean = 100)) %>%
  as_tibble(.name_repair = "minimal") %>%
  set_names(paste0("p", 1:ncol(.)))

subset <- paste0("p", sample(1:n, n/2)) # random half of columns

# a function to pass to `mutate_at()`
# ratio to entry[1]
ratio <- function(x) x / x[1L]

# Use the bench pkg to compare
# base `apply()` to `mutate_at()`
bnch <- mark(
  base_apply = { df[, subset] <- apply(df[, subset], 2, ratio); df },
  dplyr_mutate_at = { mutate_at(df, subset, ratio) }
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

# Absolute differences
bnch
#> # A tibble: 2 x 10
#>   expression         min   mean median    max `itr/sec` mem_alloc  n_gc
#>   <bch:expr>      <bch:> <bch:> <bch:> <bch:>     <dbl> <bch:byt> <dbl>
#> 1 base_apply      45.9ms 50.9ms 50.1ms 60.1ms   19.6       35.8MB    17
#> 2 dplyr_mutate_at  18.3s  18.3s  18.3s  18.3s    0.0545   214.1MB   506
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>

# Relative differences
summary(bnch, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression        min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <bch:expr>      <dbl> <dbl>  <dbl> <dbl>     <dbl>     <dbl> <dbl> <dbl>
#> 1 base_apply         1     1      1     1       360.      1      1      10
#> 2 dplyr_mutate_at  400.  360.   366.  305.        1       5.98  29.8     1
#> # … with 1 more variable: total_time <dbl>

Created on 2019-02-04 by the reprex package (v0.2.1)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.5.2 (2018-12-20) #> os macOS Mojave 10.14.3 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Denver #> date 2019-02-04 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) #> backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) #> bench * 1.0.1.9000 2019-02-01 [1] Github (r-lib/bench@3e5d63f) #> bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.0) #> bindrcpp * 0.2.2 2018-03-29 [1] CRAN (R 3.5.0) #> callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) #> cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) #> devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1) #> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) #> dplyr * 0.7.8 2018-11-10 [1] CRAN (R 3.5.0) #> evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.0) #> fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.0) #> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) #> glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) #> highr 0.7 2018-06-09 [1] CRAN (R 3.5.0) #> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) #> knitr 1.21 2018-12-10 [1] CRAN (R 3.5.1) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) #> pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) #> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) #> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) #> processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.0) #> profmem 0.5.0 2018-01-30 [1] CRAN (R 3.5.0) #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) #> purrr * 0.2.5 2018-05-29 [1] CRAN (R 3.5.0) #> R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.0) #> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) #> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) #> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) #> rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) #> stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.0) #> stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.0) #> testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.0) #> tibble * 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) #> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) #> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) #> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.0) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) #> xfun 0.4 2018-10-23 [1] CRAN (R 3.5.0) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) #> #> [1] /Users/sfield/r_libs #> [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
romainfrancois commented 5 years ago

I'm getting a much less dramatic difference with dev (0.8.0):

suppressMessages(library(dplyr))
library(purrr)
library(bench)
library(tibble)
#> Warning: package 'tibble' was built under R version 3.5.2
set.seed(101)

# Create tibble with `n` columns, each with 100 Gaussian mean = 100
# Name p1 -> p_n
n  <- 2500
df <- rerun(n, rnorm(100, mean = 100)) %>%
  as_tibble(.name_repair = "minimal") %>%
  set_names(paste0("p", 1:ncol(.)))

subset <- paste0("p", sample(1:n, n/2)) # random half of columns

# a function to pass to `mutate_at()`
# ratio to entry[1]
ratio <- function(x) x / x[1L]

# Use the bench pkg to compare
# base `apply()` to `mutate_at()`
bnch <- mark(
  base_apply = { df[, subset] <- apply(df[, subset], 2, ratio); df },
  dplyr_mutate_at = { mutate_at(df, subset, ratio) }
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

# Absolute differences
bnch
#> # A tibble: 2 x 10
#>   expression     min    mean  median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 base_apply  51.4ms  57.4ms  56.7ms  67ms     17.4     35.8MB    13     9
#> 2 dplyr_mut…   173ms   201ms 176.9ms 253ms      4.97    49.9MB     6     3
#> # … with 1 more variable: total_time <bch:tm>

# Relative differences
summary(bnch, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <dbl> <dbl>  <dbl> <dbl>     <dbl>     <dbl> <dbl> <dbl>
#> 1 base_apply  1     1      1     1         3.50      1     2.17     3
#> 2 dplyr_mut…  3.37  3.50   3.12  3.78      1         1.39  1        1
#> # … with 1 more variable: total_time <dbl>

We might look at it again later to understand where the time is being spent, but for now it looks "good" enough".

stufield commented 5 years ago

Yay! Thanks @romainfrancois ... I'm sorry, I should have specified that I was using dplyr 0.7.8. Waiting with bated breath for dplyr 0.8.0!!!! Thank you.

lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/