REP-001: Serve Pipeline API

jiaodong commented 2 years ago

Summary - Serve Pipeline

General Motivation

Production machine learning serving pipelines are getting longer and wider. They often consist of multiple, or even tens of models collectively making a final prediction, such as image / video content classification and tagging, fraud detection pipeline with multiple policies and models, multi-stage ranking and recommendation, etc.

Meanwhile, the size of a model is also growing beyond the memory limit of a single machine due to the exponentially growing number of parameters, such as GPT-3, sparse feature embeddings in recsys models such that the ability to do disaggregated and distributed inference is desirable and future proof.

We want to leverage the programmable and general purpose distributed computing ability of Ray, double down on its unique strengths (scheduling, communication and shared memory) to facilitate authoring, orchestrating, scaling and deployment of complex serving pipelines under one set of DAG API, so a user can program & test multiple models or multiple shards of a single large model dynamically, deploy to production at scale, and upgrade individually.

Key requirements:

Provide the ability to author a DAG of serve nodes to form a complex inference graph.
Pipeline authoring experience should be fully python programmable with support for dynamic selection, control flows, user business logic, etc.
DAG can be instantiated and locally executed using tasks and actors API
DAG can be deployed via declarative and idempotent API, individual nodes can be reconfigured and scaled indepenently.

Should this change be within `ray` or outside?

main ray project. Changes are made to Ray Core and Ray Serve level.

Stewardship

Required Reviewers

The proposal will be open to the public, but please suggest a few experience Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.

@ericl, @edoakes, @simon-mo, @jiaodong

Shepherd of the Proposal (should be a senior committer)

To make the review process more productive, the owner of each proposal should identify a shepherd (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.

@ericl

Design and Architecture

Example - Diagram

We want to author a simple diamond-shaped DAG where user provided inputs is send to two models (m1, m2) where each access partial or idential input, and also forward part of original input to the final ensemble stage to compute final output.

               m1.forward(dag_input[0])
            /                          \
    dag_input ----- dag_input[2] ------ ensemble -> dag_output
            \                          /  
               m2.forward(dag_input[1])

Example - Code

Classes or functions decorated by ray can be directly used in Ray DAG building.

@ray.remote
class Model:
def __init__(self, val):
    self.val = val
def forward(self, input):
    return self.val * input

@ray.remote
def ensemble(a, b, c):
    return a + b + c

async def request_to_data_int(request: starlette.requests.Request):
    data = await request.body()
    return int(data)

# Args binding, DAG building and input preprocessor definition
with ServeInputNode(preprocessor=request_to_data_int) as dag_input:
    m1 = Model.bind(1)
    m2 = Model.bind(2)
    m1_output = m1.forward.bind(dag_input[0])
    m2_output = m2.forward.bind(dag_input[1])
    ray_dag = ensemble.bind(m1_output, m2_output, dag_input[2])

A DAG authored with Ray DAG API should be locally executable just by Ray Core runtime.

# 1*1 + 2*2 + 3
assert ray.get(ray_dag.execute(1, 2, 3)) == 8

A Ray DAG can be built into an serve application that contains all nodes needed.

# Build, configure and deploy
app = serve.pipeline.build(ray_dag)

Configure individual deployments in app, with same variable name used in ray_dag.

app.m1.set_options(num_replicas=3)
app.m2.set_options(num_replicas=5)

We reserve the name and generate a serve ingress deployment that takes care of HTTP / gRPC, input schema validation, adaption, etc. It's our python interface to configure pipeline ingress.

app.ingress.set_options(num_replicas=10)

# Translate to group_deploy behind the scene
app_handle = app.deploy()

# Serve App is locally executable
assert ray.get(app_handle.remote(1, 2, 3)) == 8

A serve pipeline application can be built into a YAML file for structured deployment, and configurable by the Ops team by directly mutating configurable fields without deep knowledge or involvement of model code in the pipeline.

deployment.yaml = app.to_yaml()

# Structured deployment CLI
serve deploy deployment.yaml

Compatibility, Deprecation, and Migration Plan

An important part of the proposal is to explicitly point out any compability implications of the proposed change. If there is any, we should thouroughly discuss a plan to deprecate existing APIs and migration to the new one(s).

Ray Core
- Serve Pipeline is co-designed with Ray Unified DAG API, where each DAG is always authored using Ray DAG API first.
- The only new API introduced is .bind() method on ray decorated function or class.
Ray Serve
- Serve Pipeline DAG is transformed from Ray DAG where classes used are replaced with serve Deployment and class instances with deployment's RayServeHandle for better compatibility, deprecation as well as migration.
Breaking Changes: Ray Serve
- All args and kwargs passed into class or function in Serve Pipeline needs to be JSON serializable, enforced upon build() call.
- We need to introduce and abstract out an Ingress component for serve pipeline.
Deprecation
- Existing Serve Pipeline Alpha API will be deprecated in favor of Ray Unified DAG API as well as Serve Pipeline Beta.
Migration Plan: Ray Serve
- New concepts and API introduced will be applied to Serve Pipeline Beta launch first to minimize compatibility risks. We can expect existing deployment implementation will migrate to Ingress and Serve App APIs later on.
- Existing multi-model pipeline using Alpha API or raw deployment handle is expected to be migrated to Pipeline Beta API over time.

Test Plan and Acceptance Criteria

The proposal should discuss how the change will be tested before it can be merged or enabled. It should also include other acceptance criteria including documentation and examples.

Unit and integration test for core components
Benchmarks on common multi-model inference workload
Documentation with representative workload, covered by CI.

(Optional) Follow-on Work

Performance optimizations for multi-model inference, such as communication, multiplexing, scale-to-zero, etc.
UX and UI improvements for better user experience
Exploration of large model Distributed Inference on Serve Pipeline where each node represents a shard of a large model.

zhe-thoughts commented 2 years ago

Thanks @jiaodong ! Assigning to @edoakes @ericl @simon-mo for reviews

simon-mo commented 2 years ago

@zhe-thoughts for the process, should this be a PR as well? Or the already merged #3 counts.

zhe-thoughts commented 2 years ago

@simon-mo : We should use PRs to finalize a REP (so it should mainly be the proposer and the shepherd working on the PR). Then we merge the PR, and the REP becomes a shepherded design proposal.

Then we use the issue to comment on the design proposal. How does it sound? Do you think it's easier to comment on the proposal as a PR?

It's still early in the process and we should iterate

simon-mo commented 2 years ago

I would imagine the comment process involves proposer iterating on the content to incorporate feedback from the reviewers as well.

ericl commented 2 years ago

We've gotten feedback the separate issue thing is quite confusing. I think we should just stick to routing comments to the main PR, whether or not it's merged. I pushed a change to the README to direct readers to do that.

zhe-thoughts commented 2 years ago

Thanks @ericl , the process change sounds good to me. @simon-mo also gave that feedback. I think the only downside is that for Markdown, the experience of reviewing the PR could be a bit suboptimal. But overall I agree with making the change.

ray-project / enhancements