ray-project / ray_beam_runner

Ray-based Apache Beam runner
Apache License 2.0
41 stars 12 forks source link

[batch] Create Python Ray Runner based on Beam Portability API #2

Open pdames opened 2 years ago

pdames commented 2 years ago

To support Beam pipeline graph construction, optimization, and submission for execution, we need to build a Ray Runner class based on Beam Portability API's FnApiRunner that supports both local (i.e. single-node) and remote (i.e. multi-node) pipeline execution.

At a high-level, its run_pipeline method should take a Beam pipeline DAG to run as input, and produce a PipelineResult that can describe/manage pipeline execution state as output. The Ray Runner Beam Components Doc provides additional details about the FnApiRunner's end-to-end workflow, and how it relates to Ray.

This runner will also depend on successful implementation of the following components:

  1. Ray Work Item Scheduler: The Ray Work Item Scheduler takes work items from the batch FnApiRunner's topological scheduler as input, and submits them for execution as Ray tasks.
  2. Ray Pipeline State Manager: The Ray Pipeline State Manager is a central service that consolidates the execution state of all scheduled pipeline work items in Ray's object store.