tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

Using multidplyr within a function causes unexpected behaviour #123

Closed avsdev-cw closed 2 years ago

avsdev-cw commented 2 years ago

As per title:

cluster <- multidplyr::new_cluster(parallel::detectCores())
someFunc <- function(data, cluster) {
  data %>%
    dplyr::group_by(cyl) %>%
    multidplyr::partition(cluster) %>%
    dplyr::summarise(avg_mpg = mean(mpg)) %>%
    dplyr::collect()
}
someFunc(mtcars, cluster)

If you are lucky 2 or more cores (main+workers) will randomly get used, if not then only the main thread gets used.

greentheo commented 2 years ago

+1, it seems there's a lot of data copying and other things going on in the background. Would love to see this resolved.

hadley commented 2 years ago

Duplicate of #87