Closed neelrakholia closed 6 years ago
Interesting, on my computer it's only twice as slow. Could you post your devtools::session_info()
please.
I was more interested in the preamble ;)
Oops
setting value
version R version 3.3.3 (2017-03-06)
system x86_64, darwin13.4.0
ui RStudio (1.0.136)
language (EN)
collate en_US.UTF-8
tz America/Los_Angeles
date 2017-05-22
Confirmed: 3.1s with dplyr 0.5.0, 8.2s with current dplyr.
A non-trivial amount of time is spent capturing variables. enquo()
, quos()
etc should be rewritten in C.
But the bulk of the perf gap seems to be due to select_vars()
, which loops over inputs with eval_tidy()
.
eval_tidy()
in C.overscope_eval_next()
so we don't have to rebuild an overscope for each input.Hi ! I have the same issue with summarise_at(), twice slower and maybe more, from 2h to 11 hours for a complex subsampling program (all the same except dplyr version 0.5.0 vs 0.7.0)
could you post your devtools::session_info()
on a fresh session please.
also, does any of you see an improvement with the development version of rlang, which is now bytecompiled by default?
Edit: note that you have to use devtools::install()
it not devtools::load_all()
because in that case it won't be compiled
Session info ----------------------------------------------------------------------------
setting value
version R version 3.3.2 (2016-10-31)
system x86_64, mingw32
ui RStudio (1.0.136)
language (EN)
collate French_France.1252
tz Europe/Paris
date 2017-06-21
Packages --------------------------------------------------------------------------------
package version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.3.3)
base 3.3.2 2016-11-18 local
codetools 0.2-15 2016-10-05 CRAN (R 3.3.2)
datasets 3.3.2 2016-11-18 local
DBI 0.7 2017-06-18 CRAN (R 3.3.2)
devtools 1.13.2 2017-06-02 CRAN (R 3.3.3)
digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
dplyr 0.5.0 2017-06-20 Github (tidyverse/dplyr@34b4be2)
foreach 1.4.3 2015-10-13 CRAN (R 3.3.1)
graphics 3.3.2 2016-11-18 local
grDevices 3.3.2 2016-11-18 local
iterators 1.0.8 2015-10-13 CRAN (R 3.3.1)
magrittr 1.5 2014-11-22 CRAN (R 3.3.2)
memoise 1.1.0 2017-04-21 CRAN (R 3.3.3)
methods 3.3.2 2016-11-18 local
R6 2.2.2 2017-06-17 CRAN (R 3.3.3)
Rcpp 0.12.11 2017-05-22 CRAN (R 3.3.3)
RevoUtils 10.0.2 2016-11-22 local
RevoUtilsMath 10.0.0 2016-06-15 local
rlang 0.1.1 2017-05-18 CRAN (R 3.3.3)
stats 3.3.2 2016-11-18 local
tibble 1.3.3 2017-05-28 CRAN (R 3.3.3)
tools 3.3.2 2016-11-18 local
utils * 3.3.2 2016-11-18 local
withr 1.0.2 2016-06-20 CRAN (R 3.3.2)
@cturbelin can you please share your code?
Yes, a extract of the most time consuming part of the code, with random data. Not runnable at once, and very quick & dirty, I hope this can help. From fresh session (for each part) : 170 secondes with dplyr 0.7.0 and 113 with 0.5.0 version The session_info previously posted is now version R version 3.3.3 (2017-03-06) rlang 0.1.1.9000 2017-06-21 Github (tidyverse/rlang@d92dbde)
cols = c('a','b','c')
viro = expand.grid(med=1:10,yw=1:36, tranche=1:4, id_tel=1:5)
prob = sample.int(100, size=3)/100
for(col in cols) {
viro[[col]] = rbinom(nrow(viro), 1, prob = prob[match(col, cols)])
}
calc_07 = function(viro) {
for(i in seq_len(1200)) {
prop.tel = summarise_at(
group_by(.data=viro, yw, id_tel, tranche),
.vars=cols, .funs=funs(total=sum(!is.na(.)), prop.tel=sum(., na.rm=T)/sum(!is.na(.)))
)
# Proportion positif au niveau global (zone)
prop.zone = summarise_at(group_by(.data=viro, yw, tranche), .vars=cols, .funs=funs(prop.zone=sum(., na.rm=T)/sum(!is.na(.))))
}
}
calc_05 = function(viro) {
for(i in seq_len(1200)) {
prop.tel = summarise_at(
group_by(.data=viro, yw, id_tel, tranche),
.cols=cols, .funs=funs(total=sum(!is.na(.)), prop.tel=sum(., na.rm=T)/sum(!is.na(.)))
)
# Proportion positif au niveau global (zone)
prop.zone = summarise_at(group_by(.data=viro, yw, tranche), .cols=cols, .funs=funs(prop.zone=sum(., na.rm=T)/sum(!is.na(.))))
}
}
library(devtools)
#devtools::install_github("tidyverse/dplyr")
install.packages("dplyr_0.7.0.zip", repos = NULL)
library(dplyr)
time = proc.time()
calc_07(viro)
time = as.numeric(proc.time() - time)
print(time)
.rs.restartR()
install.packages("dplyr_0.5.0.zip", repos = NULL)
# dplyr 0.5.0
# devtools::install_github("tidyverse/dplyr", ref = "34b4be202e89716c4fa3161cf0b194f31ad6e72c")
library(dplyr)
time = proc.time()
calc_05(viro)
time = as.numeric(proc.time() - time)
print(time)
@cturbelin If this is the real logic that your code is stuck at, here is an (unruly) way to optimize it.
The difference is that here mutate/summarise
uses hybrid operators only: row_number/max/mean
. Because row_number
skips over NA
values, the max(row_number(a))
returns the same value as sum(!is.na(a))
. Similarly sum(., na.rm = TRUE) / sum(!is.na(a))
looks equivalent to mean(., na.rm = TRUE)
where hybrid evaluation is available.
The only reason I used all the quosure stuff is that max(row_number(a), na.rm=TRUE)
does not trigger hybrid evaluation currently, therefore I need mutate(a_n = row_number(a)) %>% summarise(a_t = max(a_n))
instead.
Hope there will be more hybrid operators in the future or even pluggable ones where one use Rcpp to define one inline.
p.s. I do not endorse this optimization. very hacky 💩 💧 🙌
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(purrr)
library(rlang)
#>
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#>
#> %@%, %||%, as_function, flatten, flatten_chr, flatten_dbl,
#> flatten_int, flatten_lgl, invoke, list_along, modify, prepend,
#> rep_along, splice
cols = c('a','b','c')
viro = expand.grid(med=1:10,yw=1:36, tranche=1:4, id_tel=1:5)
prob = sample.int(100, size=3)/100
for(col in cols) {
viro[[col]] = rbinom(nrow(viro), 1, prob = prob[match(col, cols)])
}
calc_07 = function(viro) {
col_names = map(cols, as.name)
row_number_names = paste0(".row_number_", cols)
mutate_row_numbers =
set_names(
map(col_names, ~ quo(row_number(!!.x))),
row_number_names
)
summarise_cols = c(
set_names(
map(row_number_names, ~ quo(max(!!as.name(.x)))),
paste0(cols, "_total")
),
set_names(
map(col_names, ~ quo(mean(!!.x, na.rm = TRUE))),
paste0(cols, "_prop.tel")
)
)
for(i in seq_len(120)) {
group_by(viro, yw, id_tel, tranche) %>%
mutate(!!!mutate_row_numbers) %>%
summarise(!!!summarise_cols)
}
}
calc_07_old = function(viro) {
for(i in seq_len(120)) {
summarise_at(
group_by(.data=viro, yw, id_tel, tranche),
.vars=cols, .funs=funs(total=sum(!is.na(.)), prop.tel=sum(., na.rm=T)/sum(!is.na(.)))
)
}
}
system.time({calc_07(viro)})
#> user system elapsed
#> 1.172 0.011 1.193
system.time({calc_07_old(viro)})
#> user system elapsed
#> 13.578 0.089 13.899
There might some room for improvements in tidyselect
but in a typical example, the overhead of _at
is ok:
> library(tidyverse)
> d <- tibble(col1=parse_character(1:100000))
>
> microbenchmark::microbenchmark(
+ mutate_at(d, .funs = parse_double, .vars = vars(col1)),
+ mutate(d, col1 = parse_double(col1))
+ )
Unit: milliseconds
expr min lq mean median uq max neval
mutate_at(d, .funs = parse_double, .vars = vars(col1)) 4.552803 5.146389 5.550810 5.353240 5.639810 15.71942 100
mutate(d, col1 = parse_double(col1)) 3.664240 4.196303 4.594159 4.324803 4.548783 12.15335 100
I'll close this here now, if this is still a problem, please open an issue in the tidyselect
repo.
It seems the tidyeval perf improvements of rlang 0.2.0 paid off in this case.
yeah I guess so. I was wondering if perhaps it could perform better in cases where vars()
is just given a symbol or an enumeration of symbols because there's less work to do. Maybe overkill.
In any case, it's good enough as far as this issue is concerned.
You still need to capture the environments and symbols should already return early from the unquote detection code. I don't think it is possible to do much better in this case.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
The code is 10x slower with later versions of tibble and dplyr.
dplyr version 0.5.0 and tibble version 1.3.0 runtime: 12.005s dplyr version 0.5.0.9005 and tibble version 1.3.1 runtime: 114.202s