tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

Fixes for issues #6, #27 (or #21) #28 and #30. Added methods for rename_, tbl_vars and groups. #26

Closed Ax3man closed 7 years ago

Ax3man commented 8 years ago

This takes a slightly different approach than the pull request from @fugufisch, which first creates a cluster, and then reduces that cluster where needed. Here, partition_ will by default create a cluster that never has more nodes than the number of shards. If the user passes a cluster that is too large, the same error as before will be generated.

Test case:

d_f <- data.frame(g = rep(LETTERS[1:3], 5),
                  v = rnorm(15))
d_f <- partition(d_f, g)

This will now initialize a cluster with 3 cores. Previously it would throw the following error (if sufficient cores were available):

Error: length(values) == length(cluster) is not TRUE

codecov-io commented 8 years ago

Current coverage is 17.94%

Merging #26 into master will decrease coverage by 1.71%

@@           master        #26   diff @@
========================================
  Files           9          9          
  Lines         178        195    +17   
  Methods         0          0          
  Branches        0          0          
========================================
  Hits           35         35          
- Misses        143        160    +17   
  Partials        0          0          

Powered by Codecov. Last updated by f6bece5...8c12d63

fpbarthel commented 8 years ago

I'm consistently getting the "Error: length(values) == length(cluster) is not TRUE" values using multidplyr with any number of cores other than 2. How do I devtools::install this version?

Ax3man commented 8 years ago

@fpbarthel devtools::install_github('Ax3man/multidplyr')

hadley commented 7 years ago

The approach of just picking the needed number of cores from an existing cluster seems better to me. But I really appreciate your effort :)