ray-project / ray_beam_runner

Ray-based Apache Beam runner
Apache License 2.0
41 stars 12 forks source link

Parallelization: 'Reshuffle' data shared between stages #23

Open pabloem opened 2 years ago

pabloem commented 2 years ago

When a stage is executed (in ray_execute_bundle), its output can be immediately reshuffled so that its downstream processing can be parallelized.

When the upstream stage performs a write to GroupByKey, then we must group before reshuffling data (data belonging to the same key must be processed in the same worker).

If the upstream stage is not performing a GBK, then we can simply reshard everything without worrying about individual keys.

pabloem commented 2 years ago

@wilsonwang371 this is the task where we parallelize the processing of data : )

wilsonwang371 commented 2 years ago

Sounds good. I generally understand this. But regarding details, I will catch up with you guys by picking up and working on small tasks.