tpoisonooo commented 1 year ago

llama.onnx is primarily used for understanding LLM and converting it to NPU.

If you are looking for inference on Nvidia GPU, we have released lmdeploy at https://github.com/InternLM/lmdeploy.

It supports:

Models similar to llama ranging from 7B to 100B in size, available in huggingface or meta format
Configurable batch_size and quantization, faster than some other implementations
Tensor parallelism, allowing you to run a 65B model on multiple 3090 GPUs
Interacting with WebUI or using command line interface

tpoisonooo commented 1 year ago

19 #16 #15

tpoisonooo commented 1 year ago

22 #15

yiliu30 commented 1 year ago

Tensor parallelism

Nice work! Can tensor parallelism be implemented using both Torch and ONNX models?