tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
642 stars 74 forks source link

custom function output which can change the number of rows of the tibble does not work in partition #147

Closed skarunan closed 1 year ago

skarunan commented 1 year ago

I'm making use of the example from #93 to explain my issue.

When a simple custom function which does not change the resulting number of rows works fine in the partitions. However, when my custom function would change the resulting number of rows, I get an Error in if (loc_frame && loc_frame <= nrow(parent_trace)) { : missing value where TRUE/FALSE needed. However, the custom function works fine when I run locally and not use the partitions.

Below is a reproducible example, any insights would be great. Thank you

library(multidplyr)
library(dplyr)
library(purrr)

df <- data.frame(Grp = rep(LETTERS[1:3], each = 4), Val = rep(3:1, 4))

## Function samples n times the given data and calculates median
sampling_fun = function(x, n = 100) {
  1:n %>%
    # Sample from x with replacement - n times
    map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # calculate median for each new sample set
    map_dbl(median)
}

df %>%
  group_by(Grp) %>%
  summarise(sampled_med_Val=sampling_fun(Val))

cl <- new_cluster(3)
cluster_copy(cl, "sampling_fun")

df_clust <- df %>%
  group_by(Grp) %>%
  partition(cl)

df_clust %>%
  summarise(sampled_med_Val=sampling_fun(Val)) %>%
  collect()

## A custom function which doesn't change the row counts
cust_func <- function (x) {
  x + 1
}

cluster_copy(cl, "cust_func")

df_clust %>%
  summarise(Add1 = cust_func(Val)) %>%
  collect()
skarunan commented 1 year ago

Using the above code with the following sessionInfo and changing the sampling_fun as below fixed my issue.

sampling_fun = function(x, n = 100) {
  1:n %>%
    # Sample from x with replacement - n times
    purrr::map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # calculate median for each new sample set
    purrr::map_dbl(median)
}

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] multidplyr_0.1.2 purrr_0.3.5      dplyr_1.0.10

loaded via a namespace (and not attached):
 [1] stringfish_0.15.7   Rcpp_1.0.9          magrittr_2.0.3
 [4] qs_0.25.5           tidyselect_1.2.0    RApiSerialize_0.1.2
 [7] R6_2.5.1            rlang_1.0.6         fansi_1.0.3
[10] tools_4.2.2         utf8_1.2.2          cli_3.4.1
[13] DBI_1.1.3           RcppParallel_5.1.5  assertthat_0.2.1
[16] tibble_3.1.8        lifecycle_1.0.3     crayon_1.5.2
[19] processx_3.8.0      callr_3.7.3         vctrs_0.5.1
[22] ps_1.7.2            glue_1.6.2          compiler_4.2.2
[25] pillar_1.8.1        generics_0.1.3      pkgconfig_2.0.3