tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
642 stars 74 forks source link

Custom or "own" functions don't work with multidplyr #113

Closed internaut closed 3 years ago

internaut commented 3 years ago

In my experiments with multidplyr, I found that applying functions that I defined by myself don't work with multidplyr:

library(multidplyr)
cluster <- new_cluster(4)
cluster_library(cluster, "dplyr")

flight_dest <- nycflights13::flights %>% 
  group_by(dest) %>% 
  partition(cluster)

mymean <- function(x) {
  sum(x, na.rm = TRUE) / sum(!is.na(x))
}

mean_delay <- flight_dest %>% 
  summarise(delay = mymean(arr_delay), n = n()) %>% 
  filter(n > 25)
mean_delay

Using multidplyr 0.1.0, this fails with:

Error: Remote computation failed:
Problem with `summarise()` input `delay`.
x could not find function "mymean"

It looks like the definition of mymean is not known to the clusters. Is there a way somehow "distribute" my own functions to the clusters, too?

Since multidplyr should be used for parallelization of complex functions, I think this is an important limitation which should at least be noted in the documenation.

hadley commented 3 years ago

multidplyr::cluster_copy() 😄

internaut commented 3 years ago

Sorry, overlooked that :upside_down_face:

Guess it would be helpful to provide an example with cluster_copy() in the documentation. I think it's a very common use-case but it's not shown on project's website or in the vignette.

Edit: For others stumbling across this: Putting cluster_copy(cluster, 'mymean') after the definition of mymean in the above code solves the problem.

Fredo-XVII commented 3 years ago

@internaut You can probably close this issue...just an FYI.