tidyverse / multidplyr

A dplyr backend that partitions a data frame over multiple processes
https://multidplyr.tidyverse.org
Other
641 stars 75 forks source link

Added spread_evenly function #40

Closed kendonB closed 7 years ago

kendonB commented 8 years ago

Added spread_evenly function that robustly allocates groups of different sizes to partitions.

Inspired by http://stackoverflow.com/questions/16588669/spread-objects-evenly-over-multiple-collections.

Fixes #6, #36 and can be used with minor tweaking for #34 #35.

codecov-io commented 8 years ago

Current coverage is 50.93% (diff: 86.98%)

No coverage report found for master at f6bece5.

Powered by Codecov. Last update f6bece5...ab2a668

hadley commented 7 years ago

It's hard to review this PR because you have a whole bunch of spurious changes, and I think you've started with an overly complicated algorithm. I'd rather start with a simpler algorithm (i.e. greedily distribution each group to the smallest partition, starting from biggest groups first).

This isn't actually an example of the bin packing problem because the bins (shards) are not of fixed size - we're not trying to minimise the number of shards; we're trying to minimise the variance of shard size. An answer further down the page emphasises that and suggests the heuristic I suggested is never that bad. I'd be happy to review a PR that implemented that approach.