In this repository, we give an example on how to efficiently package and deploy Llama2, using NVIDIA Triton Inference Server to make it production-ready in no time.
We cover three different deployment approaches:
By exploiting Triton’s concurrent model execution feature, we have gained a x1.5 increase in throughput by deploying two parallel instances of the Llama2 7B model quantized to 8 bit.
Exec time | ||
Throughput |
Implementing dynamic batching added an additional x5 increase in model’s throughput.
Exec time | ||
Throughput |
The incorporation of the vLLM framework outperformed the dynamic batching results with a x6 increase.
Exec time | ||
Throughput |
Deploying Llama2 with NVIDIA Triton Inference Server blog post.
NVIDIA Triton Inference Server Official documentation.