Model | Batch | Hardware | ttft (ms) | t/s/u | Target t/s/u |
t/s | Release |
---|---|---|---|---|---|---|---|
Falcon7B-decode | 32 | e150 | 4.2 | 4.4 | 134.4 | ||
Falcon7B | 32 | n150 | 71 | 17.6 | 26 | 563.2 | v0.53.0-rc44 |
Mistral-7B | 32 | n150 | 9.9 | 25 | 316.8 | v0.51.0-rc28 | |
Mamba-2.8B | 32 | n150 | 48 | 12.3 | 41 | 393.6 | v0.51.0-rc26 |
LLaMA-3.1-8B | 1 | n150 | 209 | 23.7 | 23 | 23.7 | v0.53.0-rc44 |
LLaMA-3.2-1B | 1 | n150 | 72 | 86.4 | 160 | 86.4 | v0.53.0-rc44 |
LLaMA-3.2-3B | 1 | n150 | 123 | 44.7 | 60 | 44.7 | v0.53.0-rc44 |
Falcon7B (DP=8) | 256 | QuietBox | 97 | 14.6 | 26 | 3737.6 | v0.53.0-rc44 |
LLaMA-3.1-70B (TP=8) | 32 | QuietBox | 190 | 15.1 | 20 | 483.2 | v0.53.0-rc36 |
Falcon40B (TP=8) | 32 | QuietBox | 5.3 | 36 | 169.6 | v0.53.0-rc39 | |
Mixtral7Bx8 (TP=8) | 32 | QuietBox | 230 | 14.6 | 33 | 467.2 | v0.53.0-rc44 |
Falcon7B (DP=32) | 1024 | Galaxy | 242 | 4.4 | 26 | 4505.6 | v0.53.0-rc33 |
LLaMA-3.1-70B (DP=4, TP=8) | 128 | Galaxy | 190 | 14.3 | 20 | 1835.5 | v0.52.0-rc31 |
Last Update: November 18, 2024
Notes:
- TP = Tensor Parallel, DP = Data Parallel; Defines parallelization factors across multiple devices.
- The reported LLM performance is for an input sequence length (number of rows filled in the KV cache) of 128 for all models except Mamba (which can accept any sequence length).
- The t/s/u reported is the throughput of the first token generated after prefill, i.e. 1 / inter token latency.
Model | Batch | Hardware | fps | Target fps | Release |
---|---|---|---|---|---|
ResNet-50 (224x224) | 20 | e150 | 5,100 | 10,000 | |
ResNet-50 (224x224) | 16 | n150 | 4,670 | 7,000 | |
ResNet-50 (224x224) (DP=2) | 32 | n300 | 8,200 | 14,000 | |
ResNet-50 (224x224) (DP=8) | 128 | QuietBox | 32,250 | 56,000 | |
ResNet-50 (224x224) (DP=32) | 512 | Galaxy | 95,900 | 224,000 | |
ResNet-50 (224x224) (DP=64) | 1024 | Two Galaxies | 145,000 | 448,000 | |
ViT (224x224) | 9 | e150 | 1,360 | 2,000 | |
ViT (224x224) | 8 | n150 | 912 | 1,600 | |
Stable Diffusion 1.4 (512x512) | 1 | n150 | 0.167 | 0.3 | |
Yolo V4 (320x320) | 1 | n150 | 95 | 300 | |
Segformer Semantic Segmentation (512x512) | 1 | n150 | 90 | 300 |
Model | Batch | Hardware | sen/sec | Target sen/sec | Release |
---|---|---|---|---|---|
BERT-Large | 12 | e150 | 370 | 410 | |
BERT-Large | 8 | n150 | 270 | 400 | |
T5 small | e150 | 140 | |||
Bloom | e150 | 70 |
For the latest model updates and features, please see MODEL_UPDATES.md
Get started with simple kernels.