Motivation

Serving and inference engine are coupled now. It's not a good backend architecture.
HTTP server and engine are in the same process of worker 0. Due to GIL, they will preempt CPU, which decreases throughput.
Pipeline parallelism uses send() and recv(), which requires to customize communication for each model. This is not flexible.

Design

Here is a 2 TP + 2 PP example. Each square means a process. Pipe is implemented by RPC and Queue.

Serving and inference engine are decoupled. You can use python HTTP server, gRPC or any server you like.
Engine and workers are not in the same process. This means computation and I/O (except pipe communication) won't preempt CPU.
Pipe is implemented by torch.distributed.rpc and Queue. As pytorch says RPC messages are sent and received in parallel to execution of Python code, we can assume that computation and pipe communication are able to overlap.
As RPC call supports various input type and CUDA tensors (D2D), we are not required to customize communication for pipeline parallelism.