inference of pre-trained model

Hi, I am very interested in the distributed inference of Colossal AI. Since we have pre-trained NLP models from Pytorch or JAX, I wonder if possible or what should be done to use EnergonAI for inference. Since at the inference(model production) stage, the requirement for a smaller model instance is much more needed than in the training stage, just imagine you have a NLP model server to produce result to the client.

From your document, For models trained by [Colossal-AI](https://github.com/hpcaitech/ColossalAI), they can be seamlessly transferred to Energon-AI. For single-device models, they require manual coding works to introduce tensor parallelism and pipeline parallelism. I do not have a good clue on how this is related to my question. If you have some examples, I am eager to take a study.

For Microsoft DeepSpeed, they claim DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. I am wondering if Colossal AI has similar capability.

hpcaitech / EnergonAI

inference of pre-trained model #125