Open pabloem opened 2 years ago
@wilsonwang371 this is the task where we parallelize the processing of data : )
Sounds good. I generally understand this. But regarding details, I will catch up with you guys by picking up and working on small tasks.
When a stage is executed (in
ray_execute_bundle
), its output can be immediately reshuffled so that its downstream processing can be parallelized.When the upstream stage performs a write to GroupByKey, then we must group before reshuffling data (data belonging to the same key must be processed in the same worker).
If the upstream stage is not performing a GBK, then we can simply reshard everything without worrying about individual keys.