tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

Make partition work with invoke_rows #51

Closed stanstrup closed 5 years ago

stanstrup commented 7 years ago

It seems invoke_rows doesn't accept a party_df object. That would be useful...

cluster <- c(detectCores(), length(unique(mtcars$carb))/2) %>% min %>% create_cluster()
mtcars %>% partition(carb, cluster=cluster) %>% invoke_rows(.f = sum)

--> Error: .d must be a data frame

jepusto commented 7 years ago

Wrapping in do() makes the above example work:

cars_serial <- 
  mtcars %>% 
  invoke_rows(.f = sum) %>%
  unnest()

cars_parallel <- 
  mtcars %>% 
  partition(carb, cluster=cluster) %>% 
  do(invoke_rows(.f = sum, .d = .)) %>%
  collect() %>%
  unnest()

setdiff(cars_serial, cars_parallel) %>% nrow()
stanstrup commented 7 years ago

Thanks!

stanstrup commented 7 years ago

The work around now gives me:

Warning message:
group_indices_.grouped_df ignores extra arguments 

I am not understanding what goes wrong here...

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                   
[5] LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_0.6.2.9000      purrrlyr_0.0.1.9000   multidplyr_0.0.0.9000 dplyr_0.5.0.9005     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10      digest_0.6.12     withr_1.0.2       assertthat_0.2.0  R6_2.2.1          git2r_0.18.0      magrittr_1.5     
 [8] httr_1.2.1        rlang_0.1.9000    lazyeval_0.2.0    curl_2.6          devtools_1.13.0   tools_3.3.3       glue_1.0.0       
[15] memoise_1.1.0     knitr_1.15.1      tibble_1.3.0.9006
Ax3man commented 7 years ago

Most likely because you have updated dplyr to the latest dev version, but multidplyr isn't up to date.

derekpowell commented 6 years ago

Sorry to resurrect this issue, I'm getting the same group_indices_.grouped_df ignores extra arguments warning. As far as I can tell it's not creating any real issues, but I'm concerned I'm missing something. So, I'm just wondering, should I be worried?

Here's a minimal example:

library(tidyverse)
library(multidplyr)

df <- data.frame(A=c(1,2,3,4,5,6),
                     B=c(4,5,5,6,8,4),
                     group=c(1,1,1,2,2,2))

cluster <- create_cluster(2)
byGroup <- partition(df, group, cluster=cluster)

The resulting byGroup is a party_df that looks correct to me:

> byGroup
Source: party_df [6 x 3]
Groups: group
Shards: 2 [3--3 rows]

# S3: party_df
      A     B group
  <dbl> <dbl> <dbl>
1     1     4     1
2     2     5     1
3     3     5     1
4     4     6     2
5     5     8     2
6     6     4     2

Here's the relevant parts of my sessionInfo():

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] multidplyr_0.0.0.9000 modelr_0.1.1          dplyr_0.7.4           purrr_0.2.4          
 [5] readr_1.1.1           tidyr_0.7.2           tibble_1.3.4          ggplot2_2.2.1        
 [9] tidyverse_1.1.1       bnlearn_4.2          
hadley commented 5 years ago

This will eventually be fixed by an implementation group_map()/group_modify(); I don't currently have plans to add support for purrr/purrlyr.