[RFC]: Torchserve Large Model Inference

Problem statement

Currently, Torchserve does not have a general solution for serving large models for inference.The only available support is in HuggingFace(HF) example for serving GPT2 style model using parallelize feature from HF. Serving large models uses model parallel solutions; a model is partitioned and placed on different devices and inference input will pass through these partitions until complete the forward pass. You can read more about it here.

Asking users to manually handle model partitioning imposes a lot of complexity to users and complicates the Torchserve backend/frontend configs in terms of number of workers and gpu assignments. Ideally we need an auto partitioner to take a checkpoint from the user, partition it and place on available devices.

User API

The user API would be similar to the following, it provides the flexibility to be used with custom handlers or integrate to base_handler.

#in config.properties
parallelize = true
number_of_gpu = x

# in  handler
from XXX import auto_partitioner

model = auto_partitioner("model_chkpoints", number_of_gpus=x)

When would an Auto partitioner be available?

Pytorch Distributed team is working on this feature,the ETA would be around July 2022.

Available solutions

So far most of the serving solutions for the large scale models are very much tailored for a specific architecture, which means to partition a model into multiple devices the full architecture of the models need to be known beforehand. Examples of these solutions can be found in Triton for Faster Transformer models, HuggingFace for GPT2 and T5.

Also, approaches like DeepSpeed suggest users providing partial checkpoints and it takes care of its placements on multiple devices.

The manual model parallelization with loading partial checkpoints or parallelizing a specific model through its known configs can be done in Torchserve custom handler already as we will show later in this doc.

Missing solution

Auto partitioner is not available in Pytorch yet. Ideally, we need a solution where users bring an arbitrary checkpoint and the serving solution can load the model, automatically partition it and place it over different devices.

The auto partitioner has been supported in Sagemaker model parallel library, however it has been in use for training purpose and inference solution yet to be available. This feature does not exist in Pytorch core ATM. Not having visibility into sub-modules in the user-defined model is the main issue.

There are two potential solutions to this problem

Similar approach to Sagemaker model parallel library.
Using Torch.FX

Pytorch distributed team is working to support this feature for Torchserve.

How does the Sagemaker model parallel handle auto partitioning?

During the first training step, the model parallel library internally runs a tracing step that is meant to construct the model graph and determine the tensor and parameter shapes. After this tracing step, the library constructs a tree, which consists of the nested nn.Module objects in the model, as well as additional data gathered from tracing, such as the amount of stored nn.Parameters, and execution time for each nn.Module.

Next, the library traverses this tree from the root and runs a partitioning algorithm that assigns each nn.Module to a device, which balances computational load (measured by module execution time) and memory use (measured by the total stored nn.Parameter size and activations). If multiple nn.Modules share the same nn.Parameter, then these modules are placed on the same device to avoid maintaining multiple versions of the same parameter. Once the partitioning decision is made, the assigned modules and weights are loaded to their devices.

How does Torchserve work?

Torchserve is a model serving library that uses REST APIs to handle connection between client and server. Frontend has been implemented using Java and backend is in Python.

Backend is responsible for initializing/ loading the model, running the inference and preparing the response.

Device assignment :

Torchserve accepts a number of available gpus and workers as a config. It then uses round robin on the number of available gpus to instantiate works on them.
Each worker has a copy of the model + cuda run time.
For example if we have 2 gpus and 3 workers, it will have workers assigned to gpus as follows : worker 1 : gpu1, worker2 : gpu2 ,worker3:gpu1.

Possible scenarios for large model inference in Torchserve

Here, we will list all the scenarios for serving large models on Torchserve. With large model inference, we are targeting models that would not fit into one gpu. Hence Model Parallel paradigm would be what we are looking for in this context.

Load partial checkpoints on the available devices (similar to [DeepSpeed](https://github.com/microsoft/DeepSpeed/blob/b6f0ac97ae03e8bc71f75991eb4a8a7f28d1fd9b/deepspeed/inference/engine.py#L36) , [load_state_dict](https://github.com/microsoft/DeepSpeed/blob/b6f0ac97ae03e8bc71f75991eb4a8a7f28d1fd9b/deepspeed/runtime/state_dict_factory.py#L117)):

def load_partial_checkpoints(ckpt_list, world_size):
   num_ckpt = len(ckpt_list)
   assert world_size % num_ckpt == 0, 'Invalid checkpoints and world size for sd split'
   model = torch.load(ckpt_list[ckpt_index],
                   map_location= cuda:world_size[index])

Pros

Easy to implement in the custom handler

Cons

Users need to bring their partial checkpoints/ saved on each device while training.
Number of checkpoints should match the world size( #gpus) for inference
This might not be efficient as training vs inference requires different memory

Model specific support (similar to Triton, HF), where model parallelism is implemented using the known architecture of the model

from transformers import  AutoModelForCausalLM
#known configs for the model such as HF model
model = AutoModelForCausalLM.from_pretrained(model_dir)
model.parallelize()

Pros

Can be borrowed from libraries that support it such as HF/ already available in Tochserve.

Cons

It is specific to a model config, can not generalize
User need to do that manually and use it in the custom handler

Auto model partitioner , that would take a full checkpoint and shard it (auto paritoner) and load it on assigned devices.

# in config.properties
parallelize = true
number_of_gpu = x
# in handler 
from XXX import auto_partitioner
model = auto_partitioner("model_chkpoints", number_of_gpus=x)

Pros

It is a generalized solution, user can bring an arbitrary trained model checkpoint
Remove the burden from user to handle parallelization
Give more control to Torchserve to do perf optimization

Cons

Takes longer to develop
Might not cover models that include DAGs

Desired general solution

Auto model partitioner that support arbitrary checkpoints, from discussions with

Arbitrary checkpoints means any .pt checkpoint. Not only specific models/ pretrained/ that have a known structure through their configurations(e.g GPT, BERT).
Extracting the structure of the model is not easy from checkpoints it needs 1) load the model 2) use torch.fx/TorchScript to do tracing 3) partition into stages 4) move params to the target device and insert D2D comm ops at stage boundaries accordingly.
- Step 1 can be handled using loading on the CPU for now until the meta device gets ready.
- Step 2 is available, but with caveats. For example, HF models do not work with the default fx tracer, and HF provided their own fx module and tracer. This means for different models, we or users might have to manually choose which tracer to use.
- Step 3 can be done based on the traced execution order and submodule sizes in the initial version. But this does not necessarily guarantee the best performance, especially for models with DAG structure.
- Step 4 can borrow the idea from PiPPy, i.e., replacing stage submodules with a helper wrapper module. The difference is that, since we only need single-process multi-device support for inference, the wrapper module itself can directly move activations to the next device.

Performance considerations

Batch scheduling – pipeline parallelism might be required to handle larger batch sizes to efficiently use multiple devices.
Profiling the multi-device model forward pass to tune the batch size to meet SLA.
Deferred module initialization for model materialization, this would be very useful when a module is memory-wise too big or computationally too expensive to construct on a single machine, but needs to be inspected for various reasons such as automated sharding.Basically can load tensors without being materialized.

Open questions

Should Torchserve support multi-model serving when model parallel inference is in use? – it might be still possible if not using all gpus for parallelizing inference.
How batch scheduling should be done/ pipeline parallelism is necessary? ( measure performance hits) — it can be useful if dealing with a large batch of inputs, which might be more suitable for non real time applications.

Acknowledgements

We would like to thank @msaroufim , @chauhang, @lxning, @jamesr66a ,@pbelevich, @kwen2501 and @cbalioglu for their great support, insightful inputs and comments in authoring this RFC.

pytorch / serve