Deploying Llama2 with NVIDIA Triton Server tutorial

In this repository, we give an example on how to efficiently package and deploy Llama2, using NVIDIA Triton Inference Server to make it production-ready in no time.

Features

Concurrent model execution
Multi GPU support
Dynamic Batching
vLLM support

Examples

We cover three different deployment approaches:

Using HuggingFace models with Triton’s Python Backend
Using HuggingFace models with Triton’s Ensemble models
Using the vLLM framework

Results
By exploiting Triton’s concurrent model execution feature, we have gained a x1.5 increase in throughput by deploying two parallel instances of the Llama2 7B model quantized to 8 bit.

1 instance 2 instances

Exec time 9.79s 6.72s

Throughput 10.6 token/s 15.5 token/s
Implementing dynamic batching added an additional x5 increase in model’s throughput.

Batch size = 1 Batch size = 2

Exec time 9.79s 17.07s

Throughput 10.6 token/s 66.5 token/s
The incorporation of the vLLM framework outperformed the dynamic batching results with a x6 increase.

Batch size = 1 Batch size = 2

Exec time 2.06s 3.12s

Throughput 50 token/s 363 token/s

1 instance	2 instances
Exec time	9.79s	6.72s
Throughput	10.6 token/s	15.5 token/s

Batch size = 1	Batch size = 2
Exec time	9.79s	17.07s
Throughput	10.6 token/s	66.5 token/s

Batch size = 1	Batch size = 2
Exec time	2.06s	3.12s
Throughput	50 token/s	363 token/s

Documentation

Deploying Llama2 with NVIDIA Triton Inference Server blog post.
NVIDIA Triton Inference Server Official documentation.

marvik-ai / triton-llama2-adapter

readme

Deploying Llama2 with NVIDIA Triton Server tutorial

Features

Examples

Results

Documentation