Open dzhang32 opened 4 years ago
Don't create an destroy the cluster in your function, instead pass it in as an argument so you can re-use it.
Thanks a lot for the quick response and tip! This makes a lot of sense and yes, the run time comparison was unfair since the first didn't take into account cluster creation/destruction.
However, even after removal of cluster creation/destruction, this increase in run time when wrapping multidplyr code into a user-defined function still persists. Also the lag seems to be cumulatively worse with an increased number of workers. See example below for 4 workers:
library(tidyverse)
library(multidplyr)
library(stringr)
df <- tibble(index = rep(1:100000, 3),
to_concat = (rep(1:100000, 3)))
cluster <- new_cluster(4)
system.time(expr = {
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
})
rm(cluster)
user system elapsed 0.466 0.046 2.110
user_defined_func <- function(df, cluster){
df <-
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
return(df)
}
cluster <- new_cluster(4)
system.time(expr = {
user_defined_func(df, cluster)
})
rm(cluster)
user system elapsed 1.847 0.061 3.484
I can't replicate your results:
library(tidyverse)
library(multidplyr)
library(stringr)
df <- tibble(
index = rep(1:100000, 3),
to_concat = (rep(1:100000, 3))
)
cluster <- new_cluster(2)
system.time(expr = {
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
})
#> user system elapsed
#> 0.614 0.058 2.807
f <- function(df, cluster) {
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
}
system.time(f(df, cluster))
#> user system elapsed
#> 1.258 0.055 2.697
Created on 2019-11-16 by the reprex package (v0.3.0)
I think this is because the lag when using only 2 cores, given the scale of the example, is minimal.
See below where I've increased the size of input data by 5x and run across 2-5 cores to emphasise the lag time. Hopefully this should be reproducible.
PS. Thanks for introducing me to reprex
- really useful!
library(tidyverse)
library(multidplyr)
library(stringr)
df <- tibble(
index = rep(1:500000, 3),
to_concat = (rep(1:500000, 3))
)
for(i in 2:5){
print(str_c("Number of cores: ", i))
cluster <- new_cluster(i)
print("Outside f:")
print(system.time(expr = {
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
}))
f <- function(df, cluster) {
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
}
print("Inside f:")
print(system.time(f(df, cluster)))
}
#> [1] "Number of cores: 2"
#> [1] "Outside f:"
#> user system elapsed
#> 2.557 0.235 9.804
#> [1] "Inside f:"
#> user system elapsed
#> 5.937 0.230 12.450
#> [1] "Number of cores: 3"
#> [1] "Outside f:"
#> user system elapsed
#> 2.077 0.187 7.281
#> [1] "Inside f:"
#> user system elapsed
#> 7.429 0.231 12.218
#> [1] "Number of cores: 4"
#> [1] "Outside f:"
#> user system elapsed
#> 2.013 0.165 6.352
#> [1] "Inside f:"
#> user system elapsed
#> 9.242 0.247 13.461
#> [1] "Number of cores: 5"
#> [1] "Outside f:"
#> user system elapsed
#> 2.066 0.166 6.359
#> [1] "Inside f:"
#> user system elapsed
#> 11.060 0.278 14.354
Created on 2019-11-16 by the reprex package (v0.3.0)
Ok, now I see what you mean:
library(tidyverse)
library(multidplyr)
df <- tibble(
index = rep(1:500000, 3),
to_concat = (rep(1:500000, 3))
)
cluster <- new_cluster(5)
print(system.time({
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
}))
#> user system elapsed
#> 3.175 0.313 8.864
f <- function(df, cluster) {
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
}
print(system.time(f(df, cluster)))
#> user system elapsed
#> 11.328 0.352 14.678
Created on 2019-11-18 by the reprex package (v0.3.0)
The performance difference comes from within summarise.multidplyr_party_df()
, because the generated call
is 12 MB rather than 400kB. This is because the environment of the ...
contains a bunch of stuff that we don't care about (mostly importantly in the wrapped function case, a full copy of the dataset). Even in the case that's currently fast, it looks like we're still copying stuff, so this is worth some further exploration. The easiest approach would probably be to take the same approach as dbplyr and dtplyr, and walk the quosure to embed needed data (building from dtplyr:::dt_squash()
).
Has there been any progress on this issue? I'm in the process of updating a package of mine to use multidplyr, and I just ran into the same problem.
In addition, if I call the function and pass in the number of cores to use (8, in this case), only four of those display heavy activity in my CPU history. If I run the same code without running it as a function, all 8 cores display heavy activity.
I run into the same problem.
For now I need to go back and use the beta version which work out OK
remotes::install_github("tidyverse/multidplyr#10")
In addition, if I call the function and pass in the number of cores to use (8, in this case), only four of those display heavy activity in my CPU history. If I run the same code without running it as a function, all 8 cores display heavy activity.
Also having this problem, opened a new issue: #123
I would like to wrap multidplyr code into a function, since I am applying the parallelised multidplyr code to multiple datasets. However, this significantly slows the processing speed, to a point where the parallelisation of the code is not advantageous anymore in terms of run time.
Is there a fix for this? Am I doing something incorrectly? I believe that this may have something to do with the global environment vs local environment to the user-defined function.
Below is an example to illustrate the effect of wrapping multidplyr functions into a user-defined function in terms of run time.