[RFC] Sequence Batching for Stateful Inference

lxning commented 1 year ago

🚀 The feature

Author: Li Ning

Background

A stateful model possesses the ability to detect interdependencies between successive inference requests. This type of model maintains a persistent state across inference requests, thereby establishing a linkage between the outcomes of prior inquiries and those that follow. Notable illustrations of stateful models encompass online speech recognition systems, such as the Long Short-Term Memory (LSTM) model. Employing stateful inference mandates that the model server adheres to the sequential order of inference requests, ensuring predictions build upon the previous outcomes.

TorchServe is a stateless model server. It treats each inference request as independent, and does not maintain the state of inference requests. A stateful model requests TorchServe to be extended to a stateful model server, which is generally able to

Identify a sequence of inference requests.
Monitor the transfer state of the sequence of inference requests.
- State_Completed: received the entire of sequence.
- State_Idle_Timeout: not received the next inference request within X milliseconds.
Maintain the state of inference response.
Manage sequences.
- Add a sequence.
- Remove a sequence if it is State_Completed or State_Idle_Timeout.

Within this context, TorchServe offers a mechanism known as sequence batching. This approach involves the retrieval of an individual inference request from a particular sequence, followed by the combination of multiple requests originating from different sequences into a unified batch. Each request is associated with a unique sequence ID, which serves as a key employed by custom handlers to store and retrieve values within the backend cache store, fostering efficient management of stateful inference processes. Client can also reuse the sequence ID when a connection resumes as long as the sequence is not expired on the TorchServe side.

The following picture show the workflow of stateful inference. A job group has a job queue which stores incoming inference requests from a streaming. The max capacity of a job queue is defined by maxSequenceJobQueueSize. A sequence batch aggregator polls an inference request from each job group. A batch of requests is sent to backend.

stateful_batchi

Requirements Scope

To support a stateful model, the requirements for TorchServe are scoped as the following.

GRPC stream
- A sequence of inference requests is sent to TorchServe as a continuous GRPC stream.
- Client sends each single inference request in one GRPC request in GRPC mode.
- A sequence can not idle more than X milliseconds.
- A sequence of inference requests' responses is sent to the client as a continuous GRPC stream. Server sends each inference request's response in one GRPC response in GRPC mode.
A sequence of inference requests is associated with the same sequence id string.
A stateful model configs max_idle_milliseconds. TorchServe monitors if there is an idle timeout in a sequence of inference requests.
The max_number_sequence is the max number of sequences can be accepted, which is equal to or larger than the batch size * # workers. Each inference request of a batch is from a different sequence.
User maintains the inference state in a customized handler.

Design

stateful-ds

The above picture shows the architecture changes to support stateful inference.

API Layer

Streaming applications usually use HTTP, GRPC and Kafka to transfer messages. This design only discusses GRPC stream since SageMaker does not support HTTP request streaming at this moment; Kafka or similar messaging system requests applications to support it.

A new endpoint StreamPredictions2 is introduced for sequence batching. The sequence_id defined in PredictionsRequest and PredictionResponse is similar as a topic in Kafka messaging system. TorchServe routes the inference requests to a specific worker based on thesequence_id.

message PredictionsRequest {
    // Name of model.
    string model_name = 1; //required

    // Version of model to run prediction on.
    string model_version = 2; //optional

    // Input data for model prediction
    map<string, bytes> input = 3; //required

    // SequenceId is required for StreamPredictions2 API.
    optional string sequence_id = 4; //optional
}

message PredictionResponse {
    // Response content for prediction
    bytes prediction = 1;

    // SequenceId is required for StreamPredictions2 API.
    optional string sequence_id = 2; //optional

    // Error information for StreamPredictions2 API.
    optional google.rpc.Status status = 3; //optional
}

service InferenceAPIsService {
    // Check health status of the TorchServe server.
    rpc Ping(google.protobuf.Empty) returns (TorchServeHealthResponse) {}

    // Predictions entry point to get inference using default model version.
    rpc Predictions(PredictionsRequest) returns (PredictionResponse) {}

    // Streaming response for an inference request.
    rpc StreamPredictions(PredictionsRequest) returns (stream PredictionResponse) {}

    // Bi-direction streaming inference and response.
    rpc StreamPredictions2(stream PredictionsRequest) returns (stream PredictionResponse) {}
}

Core Layer

There is only one jobQueue to store incoming inference requests in the existing TorchServe. Each worker of a model has a batcher to poll a batch of jobs from the job queue.

Stateful inference requires a sequence of inference requests to be routed to the same worker. A single jobQueue is not able to separate the jobs from the different sequences. There is a new concept "job group" introduced in this design. A job group has a single job queue storing the jobs from the same sequence. A batcher of a worker continuously polls a set of job groups; meanwhile it concurrently polls a job from each group. It ensures that each request within a batch is from a distinct sequence. A job group (ie. a sequence) is removed if there is no new job from this job group once max_idle_milliseconds is reached.

stateful-addjob

Backend Layer

User chooses a caching solution in the customized handler to store and fetch the inference state based on inference sequenceId. There is a separate TorchServe Cache RFC to cover this part.

Motivation, pitch

A stateful model possesses the ability to detect interdependencies between successive inference requests. This type of model maintains a persistent state across inference requests, thereby establishing a linkage between the outcomes of prior inquiries and those that follow. Notable illustrations of stateful models encompass online speech recognition systems, such as the Long Short-Term Memory (LSTM) model. Employing stateful inference mandates that the model server adheres to the sequential order of inference requests, ensuring predictions build upon the previous outcomes.

Within this context, TorchServe offers a mechanism known as sequence batching. This approach involves the retrieval of an individual inference request from a particular sequence, followed by the combination of multiple requests originating from different sequences into a unified batch. Each request is associated with a unique sequence ID, which can be extracted using the "get_sequence_id" function of context.py. This sequence ID serves as a key employed by custom handlers to store and retrieve values within the backend cache store, fostering efficient management of stateful inference processes. Client can also reuse the sequence ID when a connection resumes as long as the sequence is not expired on the TorchServe side.

Alternatives

No response