tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

tutorial of multidplyr #154

Open wbvguo opened 9 months ago

wbvguo commented 9 months ago

Dear multidplyr developer,

Thank you for maintaining this package, I was wondering where we could find a more detailed tutorial of this package besides the documentation page https://multidplyr.tidyverse.org/articles/multidplyr.html?

For example, it take me a while to figure out the correct usage of mutate after data partition

# create data
set.seed(123)  # For reproducibility

num_groups = 5000
num_grp_obs= 10

df <- data.frame(
  id = 1:num_groups*num_grp_obs,
  group = rep(seq(num_groups), each = num_grp_obs),
  x = rnorm(num_groups*num_grp_obs),
  y = rnorm(num_groups*num_grp_obs)
)

df$x[c(5, 15)] <- NA # Introduce some NA values

# parallel setting
library(multidplyr)
cluster <- new_cluster(4)
cluster_library(cluster, c("dplyr"))

# partition
x_part = df %>% group_by(group) %>% nest() %>% partition(cluster) 

this will not work

x = x_part %>% mutate(fit = lm(y~x, data = .)) %>% collect()

Error in cluster_call(): ! Remote computation failed in worker 1 Caused by error: ℹ In argument: fit = lm(y ~ x, data = .). ℹ In group 1: group = 1. Caused by error: ! Native call to processx_connection_write_bytes failed Caused by error: ! Invalid connection object @processx-connection.c:960 (processx_c_connection_write_bytes) Run rlang::last_trace() to see where the error occurred.

this will work

x = mutate(fit = purrr::map(data, ~lm(y~x, data = .))) %>% collect()

Thanks!