Open jfomhover opened 3 years ago
These are my initial thoughts around data partitioning design:
split_by_size
function. Method 1: Round-Robin Target (M nodes) Each one of M nodes reads all N files sequentially and only keep rows modulo node_id.
Method 2: Round-Robin Source (N nodes)
Each one of N nodes reads a single input file and partitions it into M files by computing the modulo of the row number. Output files of the same modulo are then concatenated together, yielding the final M output files. A single node version of this method is currently implemented by the split_by_count
function.
Method 3: Sequential Split (1 node)
Estimate desired_row_count_in_output_partition = N/M * rows_in_one_input_partition
Sequentially load each of N input files and split off a new output after reaching desired_row_count_in_output_partition
.
Method 4: Parallel Splitting (N nodes) Each one of N nodes sequentially splits its dataset into M files each, preserving the row order. Then combine output partitions into groups consisting of N output files (producing a total of M files) while keeping the row order. For example, for N=3 and M=5, the first output file will consist of sub-partitions (1-1, 1-2, 1-3), the second output file will consist of sub-partitions (1-4, 1-5, 2-1), third output file: (2-2, 2-3, 2-4), fourth: (2-5, 3-1, 3-2), fifth: (3-3, 3-4, 3-5).
Let's discuss the assumptions behind partitioning, and which options we should cover in the partitioning data module.
Current module assumes records are independent, which will be wrong in some cases where they are grouped.