tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

`partition()` seams don't manage unbalanced (number) of group/core #45

Closed CorradoLanera closed 5 years ago

CorradoLanera commented 7 years ago
# partitioning 9 df-rows grouped in 7 groups on 7 core
# windows server 2012 R2 (64 bit, R3.2.2, RStudio 1.0.136) 
# `partition()` distribute data only in 6 core with more
# than a group in some core and left the last one empty.

# 4 cpu / 8 core / winserver 2012 R2
#
# note: i was not able to reproduce a similar issue on my
# 2 cpu / 4 core macbook-pro

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(multidplyr)
df <- data_frame(
    df_to_be_modelled = map(seq_len(9),
                            ~ mtcars[seq_len(.), ] 
    )
)

# suppose data are very unbalanced and that the time
# to model a couple of the first is quite the same spent
# to model one of the lasts: you like to group in a way 
# each core works quite the same amount of time
# (and use all "max - 1" core).

cluster <- create_cluster() # n - 1 =  7 by default
#> Initialising 7 core cluster.
set_default_cluster(cluster)

df %<>% mutate(group = c(1L, 2L, 2L, 1L, 3L, 4L, 5L, 6L, 7L))

df_cl <- df %>% partition(group)
df_cl
#> Source: party_df [9 x 2]
#> Groups: group
#> Shards: 6 [1--2 rows]
#> 
#> # S3: party_df
#>       df_to_be_modelled group
#>                  <list> <int>
#> 1 <data.frame [8 × 11]>     6
#> 2 <data.frame [2 × 11]>     2
#> 3 <data.frame [3 × 11]>     2
#> 4 <data.frame [5 × 11]>     3
#> 5 <data.frame [6 × 11]>     4
#> 6 <data.frame [7 × 11]>     5
#> 7 <data.frame [9 × 11]>     7
#> 8 <data.frame [1 × 11]>     1
#> 9 <data.frame [4 × 11]>     1

cluster_ls(cluster)
#> [[1]]
#> [1] "ukwoanoyti"
#> 
#> [[2]]
#> [1] "ukwoanoyti"
#> 
#> [[3]]
#> [1] "ukwoanoyti"
#> 
#> [[4]]
#> [1] "ukwoanoyti"
#> 
#> [[5]]
#> [1] "ukwoanoyti"
#> 
#> [[6]]
#> [1] "ukwoanoyti"
#> 
#> [[7]]
#> character(0)

# the first cluster have two different groups
# the last one have no groups, i.e. have no data
# note: the two observation of group 1 are both in the same
# node (i.e. cluster 4), as well as the two of group 2 (i.e. cluster 6).
# cluster 1 is the only one with two different groups.

actual_name <- cluster_ls(cluster)[[1]]
# cluster_eval(cluster, purrr::safely(print)(<name into `actual_name`>))
# sorry, I don't know how to do it in a simple automatic way
Session info ``` r devtools::session_info() #> Session info -------------------------------------------------------------- #> setting value #> version R version 3.3.2 (2016-10-31) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Italian_Italy.1252 #> tz Europe/Berlin #> date 2017-01-31 #> Packages ------------------------------------------------------------------ #> package * version date source #> assertthat 0.1 2013-12-06 CRAN (R 3.3.0) #> backports 1.0.5 2017-01-18 CRAN (R 3.3.2) #> broom 0.4.1 2016-06-24 CRAN (R 3.3.1) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2) #> DBI 0.5-1 2016-09-10 CRAN (R 3.3.2) #> devtools 1.12.0 2016-06-24 CRAN (R 3.3.2) #> digest 0.6.11 2017-01-03 CRAN (R 3.3.2) #> dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.2) #> evaluate 0.10 2016-10-11 CRAN (R 3.3.2) #> foreign 0.8-67 2016-09-13 CRAN (R 3.3.2) #> ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2) #> gtable 0.2.0 2016-02-26 CRAN (R 3.3.0) #> haven 1.0.0 2016-09-23 CRAN (R 3.3.2) #> hms 0.3 2016-11-22 CRAN (R 3.3.2) #> htmltools 0.3.5 2016-03-21 CRAN (R 3.3.2) #> httr 1.2.1 2016-07-03 CRAN (R 3.3.2) #> jsonlite 1.2 2016-12-31 CRAN (R 3.3.2) #> knitr 1.15.1 2016-11-22 CRAN (R 3.3.2) #> lattice 0.20-34 2016-09-06 CRAN (R 3.3.2) #> lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.2) #> lubridate 1.6.0 2016-09-13 CRAN (R 3.3.2) #> magrittr * 1.5 2014-11-22 CRAN (R 3.3.0) #> memoise 1.0.0 2016-01-29 CRAN (R 3.3.0) #> mnormt 1.5-5 2016-10-15 CRAN (R 3.3.2) #> modelr 0.1.0 2016-08-31 CRAN (R 3.3.2) #> multidplyr * 0.0.0.9000 2017-01-27 Github (hadley/multidplyr@0085ded) #> munsell 0.4.3 2016-02-13 CRAN (R 3.3.0) #> nlme 3.1-130 2017-01-24 CRAN (R 3.3.2) #> plyr 1.8.4 2016-06-08 CRAN (R 3.3.2) #> psych 1.6.12 2017-01-08 CRAN (R 3.3.2) #> purrr * 0.2.2 2016-06-18 CRAN (R 3.3.2) #> R6 2.2.0 2016-10-05 CRAN (R 3.3.2) #> Rcpp 0.12.9 2017-01-14 CRAN (R 3.3.2) #> readr * 1.0.0 2016-08-03 CRAN (R 3.3.2) #> readxl 0.1.1 2016-03-28 CRAN (R 3.3.2) #> reshape2 1.4.2 2016-10-22 CRAN (R 3.3.2) #> rmarkdown 1.3 2016-12-21 CRAN (R 3.3.2) #> rprojroot 1.2 2017-01-16 CRAN (R 3.3.2) #> rvest 0.3.2 2016-06-17 CRAN (R 3.3.2) #> scales 0.4.1 2016-11-09 CRAN (R 3.3.2) #> stringi 1.1.2 2016-10-01 CRAN (R 3.3.2) #> stringr 1.1.0 2016-08-19 CRAN (R 3.3.2) #> tibble * 1.2 2016-08-26 CRAN (R 3.3.2) #> tidyr * 0.6.1 2017-01-10 CRAN (R 3.3.2) #> tidyverse * 1.1.0 2017-01-20 CRAN (R 3.3.2) #> withr 1.0.2 2016-06-20 CRAN (R 3.3.2) #> xml2 1.1.1 2017-01-24 CRAN (R 3.3.2) #> yaml 2.1.14 2016-11-12 CRAN (R 3.3.2) ```
hadley commented 7 years ago

Can you please use the reprex package to generate your reprex? It will fix your formatting issues.

CorradoLanera commented 7 years ago

Done. Is it all 0k now? I didn't know that package. (note: I was not able to automatise the last expression, but I think that the results should still be clear).

CorradoLanera commented 7 years ago

It's not really fixed: I'm still working on it.

hadley commented 5 years ago

I've completely rewritten the algorithm and I'll have a fix pushed shortly.