For example, it take me a while to figure out the correct usage of mutate after data partition
# create data
set.seed(123) # For reproducibility
num_groups = 5000
num_grp_obs= 10
df <- data.frame(
id = 1:num_groups*num_grp_obs,
group = rep(seq(num_groups), each = num_grp_obs),
x = rnorm(num_groups*num_grp_obs),
y = rnorm(num_groups*num_grp_obs)
)
df$x[c(5, 15)] <- NA # Introduce some NA values
# parallel setting
library(multidplyr)
cluster <- new_cluster(4)
cluster_library(cluster, c("dplyr"))
# partition
x_part = df %>% group_by(group) %>% nest() %>% partition(cluster)
this will not work
x = x_part %>% mutate(fit = lm(y~x, data = .)) %>% collect()
Error in cluster_call():
! Remote computation failed in worker 1
Caused by error:
ℹ In argument: fit = lm(y ~ x, data = .).
ℹ In group 1: group = 1.
Caused by error:
! Native call to processx_connection_write_bytes failed
Caused by error:
! Invalid connection object @processx-connection.c:960 (processx_c_connection_write_bytes)
Run rlang::last_trace() to see where the error occurred.
this will work
x = mutate(fit = purrr::map(data, ~lm(y~x, data = .))) %>% collect()
Dear
multidplyr
developer,Thank you for maintaining this package, I was wondering where we could find a more detailed tutorial of this package besides the documentation page https://multidplyr.tidyverse.org/articles/multidplyr.html?
For example, it take me a while to figure out the correct usage of
mutate
after data partitionthis will not work
this will work
Thanks!