tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

mutate_at is slow with dplyr version 0.5.0.9005 #2813

Closed neelrakholia closed 6 years ago

neelrakholia commented 7 years ago

The code is 10x slower with later versions of tibble and dplyr.

dplyr version 0.5.0 and tibble version 1.3.0 runtime: 12.005s dplyr version 0.5.0.9005 and tibble version 1.3.1 runtime: 114.202s

library(tidyverse)

fun <- function(x) {
  tibble(
    col1 = x
  ) %>% 
  # On dplyr version 0.5.0 and tibble version 1.3.0
  # mutate_at(.funs = parse_double, .cols = vars(col1)) 

  # On dplyr version 0.5.0.9005 and tibble version 1.3.1
  mutate_at(.funs = parse_double, .vars = vars(col1))
}

i <- parse_character(1:10000)

start <- proc.time()
map_df(i, fun)
print(proc.time() - start)
lionel- commented 7 years ago

Interesting, on my computer it's only twice as slow. Could you post your devtools::session_info() please.

neelrakholia commented 7 years ago
package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.3.3) backports 1.0.5 2017-01-18 CRAN (R 3.3.2) bindr 0.1 2016-11-13 cran (@0.1) bindrcpp * 0.1 2016-12-11 cran (@0.1) broom 0.4.2 2017-02-13 CRAN (R 3.3.2) car 2.1-4 2016-12-02 CRAN (R 3.3.2) caret * 6.0-76 2017-04-18 CRAN (R 3.3.2) cellranger 1.1.0 2016-07-27 CRAN (R 3.3.0) class 7.3-14 2015-08-30 CRAN (R 3.3.3) codetools 0.2-15 2016-10-05 CRAN (R 3.3.3) colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2) compare * 0.2-6 2015-08-25 CRAN (R 3.3.0) DBI * 0.6-1 2017-04-01 CRAN (R 3.3.2) dbplyr * 0.0.0.9001 2017-05-18 Github (tidyverse/dbplyr@3258b03) devtools 1.12.0 2016-06-24 CRAN (R 3.3.0) digest 0.6.12 2017-01-27 CRAN (R 3.3.2) dplyr * 0.5.0.9005 2017-05-18 Github (tidyverse/dplyr@aece1a5) e1071 1.6-8 2017-02-02 CRAN (R 3.3.2) evaluate 0.10 2016-10-11 CRAN (R 3.3.0) forcats 0.2.0 2017-01-23 CRAN (R 3.3.2) foreach 1.4.3 2015-10-13 CRAN (R 3.3.0) foreign 0.8-67 2016-09-13 CRAN (R 3.3.3) gbm * 2.1.3 2017-03-21 CRAN (R 3.3.2) ggplot2 * 2.2.1.9000 2017-04-22 Github (tidyverse/ggplot2@f4398b6) glue 1.0.0 2017-04-17 cran (@1.0.0) gtable 0.2.0 2016-02-26 CRAN (R 3.3.0) haven 1.0.0 2016-09-23 CRAN (R 3.3.0) hms 0.3 2016-11-22 CRAN (R 3.3.2) htmltools 0.3.6 2017-04-28 cran (@0.3.6) httr 1.2.1 2016-07-03 CRAN (R 3.3.0) iterators 1.0.8 2015-10-13 CRAN (R 3.3.0) jsonlite 1.4 2017-04-08 CRAN (R 3.3.2) knitr 1.15.20 2017-04-25 Github (yihui/knitr@f3a490b) lattice * 0.20-35 2017-03-25 CRAN (R 3.3.2) lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0) lme4 1.1-13 2017-04-19 CRAN (R 3.3.2) lubridate * 1.6.0 2016-09-13 CRAN (R 3.3.0) magrittr 1.5 2014-11-22 CRAN (R 3.3.0) MASS 7.3-47 2017-04-21 CRAN (R 3.3.3) Matrix 1.2-8 2017-01-20 CRAN (R 3.3.3) MatrixModels 0.4-1 2015-08-22 CRAN (R 3.3.0) memoise 1.1.0 2017-04-21 CRAN (R 3.3.3) mgcv 1.8-17 2017-02-08 CRAN (R 3.3.3) minqa 1.2.4 2014-10-09 CRAN (R 3.3.0) mnormt 1.5-5 2016-10-15 CRAN (R 3.3.0) ModelMetrics 1.1.0 2016-08-26 CRAN (R 3.3.0) modelr * 0.1.0 2016-08-31 CRAN (R 3.3.0) munsell 0.4.3 2016-02-13 CRAN (R 3.3.0) nlme 3.1-131 2017-02-06 CRAN (R 3.3.3) nloptr 1.0.4 2014-08-04 CRAN (R 3.3.0) nnet 7.3-12 2016-02-02 CRAN (R 3.3.3) pbkrtest 0.4-7 2017-03-15 CRAN (R 3.3.2) plyr 1.8.4 2016-06-08 CRAN (R 3.3.0) psych 1.7.3.21 2017-03-22 CRAN (R 3.3.2) purrr * 0.2.2 2016-06-18 CRAN (R 3.3.0) quantreg 5.33 2017-04-18 CRAN (R 3.3.2) R6 2.2.1 2017-05-10 cran (@2.2.1) Rcpp 0.12.10 2017-03-19 CRAN (R 3.3.2) readr * 1.1.0 2017-03-22 CRAN (R 3.3.2) readxl 1.0.0 2017-04-18 CRAN (R 3.3.2) reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0) rlang 0.1 2017-05-06 cran (@0.1) rmarkdown 1.4 2017-03-24 CRAN (R 3.3.2) RPostgreSQL * 0.4-1 2016-05-08 CRAN (R 3.3.0) rprojroot 1.2 2017-01-16 CRAN (R 3.3.2) rsconnect 0.7 2016-12-21 CRAN (R 3.3.2) rvest 0.3.2 2016-06-17 CRAN (R 3.3.0) scales 0.4.1 2016-11-09 CRAN (R 3.3.2) SparseM 1.76 2017-03-09 CRAN (R 3.3.2) stringi 1.1.5 2017-04-07 CRAN (R 3.3.2) stringr 1.2.0 2017-02-18 CRAN (R 3.3.2) survival * 2.41-3 2017-04-04 CRAN (R 3.3.2) tibble * 1.3.1 2017-05-18 Github (tidyverse/tibble@8f30072) tidyr * 0.6.1 2017-01-10 CRAN (R 3.3.2) tidyverse * 1.1.1 2017-01-27 CRAN (R 3.3.2) withr 1.0.2 2016-06-20 CRAN (R 3.3.0) xml2 1.1.1 2017-01-24 CRAN (R 3.3.2) yaml 2.1.14 2016-11-12 CRAN (R 3.3.2)
lionel- commented 7 years ago

I was more interested in the preamble ;)

neelrakholia commented 7 years ago

Oops

setting value
version R version 3.3.3 (2017-03-06) system x86_64, darwin13.4.0
ui RStudio (1.0.136)
language (EN)
collate en_US.UTF-8
tz America/Los_Angeles
date 2017-05-22

krlmlr commented 7 years ago

Confirmed: 3.1s with dplyr 0.5.0, 8.2s with current dplyr.

lionel- commented 7 years ago

A non-trivial amount of time is spent capturing variables. enquo(), quos() etc should be rewritten in C.

But the bulk of the perf gap seems to be due to select_vars(), which loops over inputs with eval_tidy().

cturbelin commented 7 years ago

Hi ! I have the same issue with summarise_at(), twice slower and maybe more, from 2h to 11 hours for a complex subsampling program (all the same except dplyr version 0.5.0 vs 0.7.0)

lionel- commented 7 years ago

could you post your devtools::session_info() on a fresh session please.

lionel- commented 7 years ago

also, does any of you see an improvement with the development version of rlang, which is now bytecompiled by default?

Edit: note that you have to use devtools::install() it not devtools::load_all() because in that case it won't be compiled

cturbelin commented 7 years ago

Session info ---------------------------------------------------------------------------- setting value
version R version 3.3.2 (2016-10-31) system x86_64, mingw32
ui RStudio (1.0.136)
language (EN)
collate French_France.1252
tz Europe/Paris
date 2017-06-21

Packages -------------------------------------------------------------------------------- package version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.3.3)
base
3.3.2 2016-11-18 local
codetools 0.2-15 2016-10-05 CRAN (R 3.3.2)
datasets 3.3.2 2016-11-18 local
DBI 0.7 2017-06-18 CRAN (R 3.3.2)
devtools 1.13.2 2017-06-02 CRAN (R 3.3.3)
digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
dplyr
0.5.0 2017-06-20 Github (tidyverse/dplyr@34b4be2) foreach 1.4.3 2015-10-13 CRAN (R 3.3.1)
graphics
3.3.2 2016-11-18 local
grDevices 3.3.2 2016-11-18 local
iterators 1.0.8 2015-10-13 CRAN (R 3.3.1)
magrittr 1.5 2014-11-22 CRAN (R 3.3.2)
memoise 1.1.0 2017-04-21 CRAN (R 3.3.3)
methods
3.3.2 2016-11-18 local
R6 2.2.2 2017-06-17 CRAN (R 3.3.3)
Rcpp 0.12.11 2017-05-22 CRAN (R 3.3.3)
RevoUtils 10.0.2 2016-11-22 local
RevoUtilsMath 10.0.0 2016-06-15 local
rlang 0.1.1 2017-05-18 CRAN (R 3.3.3)
stats
3.3.2 2016-11-18 local
tibble 1.3.3 2017-05-28 CRAN (R 3.3.3)
tools 3.3.2 2016-11-18 local
utils * 3.3.2 2016-11-18 local
withr 1.0.2 2016-06-20 CRAN (R 3.3.2)

hadley commented 7 years ago

@cturbelin can you please share your code?

cturbelin commented 7 years ago

Yes, a extract of the most time consuming part of the code, with random data. Not runnable at once, and very quick & dirty, I hope this can help. From fresh session (for each part) : 170 secondes with dplyr 0.7.0 and 113 with 0.5.0 version The session_info previously posted is now version R version 3.3.3 (2017-03-06) rlang 0.1.1.9000 2017-06-21 Github (tidyverse/rlang@d92dbde)

cols = c('a','b','c')
viro = expand.grid(med=1:10,yw=1:36, tranche=1:4, id_tel=1:5)

prob = sample.int(100, size=3)/100
for(col in cols) {
  viro[[col]] = rbinom(nrow(viro), 1, prob = prob[match(col, cols)])
}

calc_07 = function(viro) {

  for(i in seq_len(1200)) {

    prop.tel = summarise_at(
      group_by(.data=viro, yw, id_tel, tranche),
      .vars=cols, .funs=funs(total=sum(!is.na(.)), prop.tel=sum(., na.rm=T)/sum(!is.na(.)))
    )

    # Proportion positif au niveau global (zone)
    prop.zone = summarise_at(group_by(.data=viro, yw, tranche), .vars=cols, .funs=funs(prop.zone=sum(., na.rm=T)/sum(!is.na(.))))
  }
}

calc_05 = function(viro) {

  for(i in seq_len(1200)) {

    prop.tel = summarise_at(
      group_by(.data=viro, yw, id_tel, tranche),
      .cols=cols, .funs=funs(total=sum(!is.na(.)), prop.tel=sum(., na.rm=T)/sum(!is.na(.)))
    )

    # Proportion positif au niveau global (zone)
    prop.zone = summarise_at(group_by(.data=viro, yw, tranche), .cols=cols, .funs=funs(prop.zone=sum(., na.rm=T)/sum(!is.na(.))))
  }
}

library(devtools)

#devtools::install_github("tidyverse/dplyr")
install.packages("dplyr_0.7.0.zip", repos = NULL)
library(dplyr)
time = proc.time()
calc_07(viro)
time = as.numeric(proc.time() - time)
print(time)

.rs.restartR()

install.packages("dplyr_0.5.0.zip", repos = NULL)
# dplyr 0.5.0
# devtools::install_github("tidyverse/dplyr", ref = "34b4be202e89716c4fa3161cf0b194f31ad6e72c")
library(dplyr)
time = proc.time()
calc_05(viro)
time = as.numeric(proc.time() - time)
print(time)
saurfang commented 6 years ago

@cturbelin If this is the real logic that your code is stuck at, here is an (unruly) way to optimize it.

The difference is that here mutate/summarise uses hybrid operators only: row_number/max/mean. Because row_number skips over NA values, the max(row_number(a)) returns the same value as sum(!is.na(a)). Similarly sum(., na.rm = TRUE) / sum(!is.na(a)) looks equivalent to mean(., na.rm = TRUE) where hybrid evaluation is available.

The only reason I used all the quosure stuff is that max(row_number(a), na.rm=TRUE) does not trigger hybrid evaluation currently, therefore I need mutate(a_n = row_number(a)) %>% summarise(a_t = max(a_n)) instead.

Hope there will be more hybrid operators in the future or even pluggable ones where one use Rcpp to define one inline.

p.s. I do not endorse this optimization. very hacky 💩 💧 🙌

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(rlang)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, %||%, as_function, flatten, flatten_chr, flatten_dbl,
#>     flatten_int, flatten_lgl, invoke, list_along, modify, prepend,
#>     rep_along, splice

cols = c('a','b','c')
viro = expand.grid(med=1:10,yw=1:36, tranche=1:4, id_tel=1:5)

prob = sample.int(100, size=3)/100
for(col in cols) {
  viro[[col]] = rbinom(nrow(viro), 1, prob = prob[match(col, cols)])
}

calc_07 = function(viro) {

  col_names = map(cols, as.name)
  row_number_names = paste0(".row_number_", cols)

  mutate_row_numbers = 
    set_names(
      map(col_names, ~ quo(row_number(!!.x))),
      row_number_names
    )
  summarise_cols = c(
    set_names(
      map(row_number_names, ~ quo(max(!!as.name(.x)))),
      paste0(cols, "_total")
    ),
    set_names(
      map(col_names, ~ quo(mean(!!.x, na.rm = TRUE))),
      paste0(cols, "_prop.tel")
    )
  )

  for(i in seq_len(120)) {
    group_by(viro, yw, id_tel, tranche) %>%
      mutate(!!!mutate_row_numbers) %>%
      summarise(!!!summarise_cols)
  }
}

calc_07_old = function(viro) {

  for(i in seq_len(120)) {
    summarise_at(
      group_by(.data=viro, yw, id_tel, tranche),
      .vars=cols, .funs=funs(total=sum(!is.na(.)), prop.tel=sum(., na.rm=T)/sum(!is.na(.)))
    )
  }
}

system.time({calc_07(viro)})
#>    user  system elapsed 
#>   1.172   0.011   1.193
system.time({calc_07_old(viro)})
#>    user  system elapsed 
#>  13.578   0.089  13.899
romainfrancois commented 6 years ago

There might some room for improvements in tidyselect

capture d ecran 2018-05-30 a 10 57 08

but in a typical example, the overhead of _at is ok:

> library(tidyverse)
> d <- tibble(col1=parse_character(1:100000))
> 
> microbenchmark::microbenchmark(
+   mutate_at(d, .funs = parse_double, .vars = vars(col1)), 
+   mutate(d, col1 = parse_double(col1))
+ )
Unit: milliseconds
                                                   expr      min       lq     mean   median       uq      max neval
 mutate_at(d, .funs = parse_double, .vars = vars(col1)) 4.552803 5.146389 5.550810 5.353240 5.639810 15.71942   100
                   mutate(d, col1 = parse_double(col1)) 3.664240 4.196303 4.594159 4.324803 4.548783 12.15335   100

I'll close this here now, if this is still a problem, please open an issue in the tidyselect repo.

lionel- commented 6 years ago

It seems the tidyeval perf improvements of rlang 0.2.0 paid off in this case.

romainfrancois commented 6 years ago

yeah I guess so. I was wondering if perhaps it could perform better in cases where vars() is just given a symbol or an enumeration of symbols because there's less work to do. Maybe overkill.

In any case, it's good enough as far as this issue is concerned.

lionel- commented 6 years ago

You still need to capture the environments and symbols should already return early from the unquote detection code. I don't think it is possible to do much better in this case.

lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/