Serving and inference engine are coupled now. It's not a good backend architecture.
HTTP server and engine are in the same process of worker 0. Due to GIL, they will preempt CPU, which decreases throughput.
Pipeline parallelism uses send() and recv(), which requires to customize communication for each model. This is not flexible.
Design
Architecture
Here is a 2 TP + 2 PP example. Each square means a process. Pipe is implemented by RPC and Queue.
Advantages
Serving and inference engine are decoupled. You can use python HTTP server, gRPC or any server you like.
Engine and workers are not in the same process. This means computation and I/O (except pipe communication) won't preempt CPU.
Pipe is implemented by torch.distributed.rpc and Queue. As pytorch says RPC messages are sent and received in parallel to execution of Python code, we can assume that computation and pipe communication are able to overlap.
As RPC call supports various input type and CUDA tensors (D2D), we are not required to customize communication for pipeline parallelism.
Great!
By the way, based on the practical deployment, the serving system should work in front of multiple engine instances. handling functions like batching and cache requests. We can refer to Triton or directly reuse it.
Motivation
send()
andrecv()
, which requires to customize communication for each model. This is not flexible.Design
Architecture
Here is a 2 TP + 2 PP example. Each square means a process. Pipe is implemented by RPC and Queue.
Advantages