Partitioning: implement multiple partitioning methods

These are my initial thoughts around data partitioning design:

Assume we have N input files which we want to partition into M output files.
We might want to able to run the data partitioning code distributively, with a preference on running on M nodes, as we might want use the same compute cluster to partition and subsequently process the M output datasets.
In some cases, we would like to limit the size of each partition (e.g., to ensure that each partition fits into a node's memory) regardless of the number of output files created. This case is not addressed in these notes. A version of this partitioning strategy is implemented in the split_by_size function.

Method 1: Round-Robin Target (M nodes) Each one of M nodes reads all N files sequentially and only keep rows modulo node_id.

Advantages:
- No write to disk required. Can be done in memory.
- Cluster that will process the re-partitioned data can do the partitioning.
Disadvantages:
- Row order not preserved.
- Each worker needs to read all N files.

Method 2: Round-Robin Source (N nodes) Each one of N nodes reads a single input file and partitions it into M files by computing the modulo of the row number. Output files of the same modulo are then concatenated together, yielding the final M output files. A single node version of this method is currently implemented by the split_by_count function.

Advantages:
- Easily parallelizable to N nodes.
Disadvantages:
- Write to disk required.
- Row order not preserved.
- Cluster that will process the re-partitioned data has different number of nodes than the cluster doing the partitioning (M != N).

Method 3: Sequential Split (1 node) Estimate desired_row_count_in_output_partition = N/M * rows_in_one_input_partition Sequentially load each of N input files and split off a new output after reaching desired_row_count_in_output_partition.

Advantages:
- Row order preserved.
Disadvantages:
- Write to disk required.
- Not easily parallelizable.

Method 4: Parallel Splitting (N nodes) Each one of N nodes sequentially splits its dataset into M files each, preserving the row order. Then combine output partitions into groups consisting of N output files (producing a total of M files) while keeping the row order. For example, for N=3 and M=5, the first output file will consist of sub-partitions (1-1, 1-2, 1-3), the second output file will consist of sub-partitions (1-4, 1-5, 2-1), third output file: (2-2, 2-3, 2-4), fourth: (2-5, 3-1, 3-2), fifth: (3-3, 3-4, 3-5).

Advantages:
- Easily parallelizable to N nodes.
- Row order preserved.
Disadvantages:
- Write to disk required.
- Cluster that will process the re-partitioned data has different number of nodes than the cluster doing the partitioning (M != N).

microsoft / lightgbm-benchmark

Partitioning: implement multiple partitioning methods #64