hpcaitech / EnergonAI

Large-scale model inference.
Apache License 2.0
630 stars 90 forks source link

[RFC] Async engine and pipeline based on RPC #151

Closed ver217 closed 2 years ago

ver217 commented 2 years ago

Motivation

  1. Serving and inference engine are coupled now. It's not a good backend architecture.
  2. HTTP server and engine are in the same process of worker 0. Due to GIL, they will preempt CPU, which decreases throughput.
  3. Pipeline parallelism uses send() and recv(), which requires to customize communication for each model. This is not flexible.

Design

Architecture

image Here is a 2 TP + 2 PP example. Each square means a process. Pipe is implemented by RPC and Queue.

Advantages

  1. Serving and inference engine are decoupled. You can use python HTTP server, gRPC or any server you like.
  2. Engine and workers are not in the same process. This means computation and I/O (except pipe communication) won't preempt CPU.
  3. Pipe is implemented by torch.distributed.rpc and Queue. As pytorch says RPC messages are sent and received in parallel to execution of Python code, we can assume that computation and pipe communication are able to overlap.
  4. As RPC call supports various input type and CUDA tensors (D2D), we are not required to customize communication for pipeline parallelism.
dujiangsu commented 2 years ago

Great! By the way, based on the practical deployment, the serving system should work in front of multiple engine instances. handling functions like batching and cache requests. We can refer to Triton or directly reuse it.