Open HaiShaw opened 3 hours ago
This is observed on H100 as well as MI300X. Expect some design changes (may assign to me)
On H100:
# python3 -m sglang.bench_latency --batch-size 32 --input 1024 --output 256 --model amd/Meta-Llama-3.1-70B-Instruct-FP8-KV --tp 8 --quantization fp8
Benchmark ...
Prefill. latency: 1.15925 s, throughput: 28266.67 token/s
Decode. latency: 0.01402 s, throughput: 2281.72 token/s
Decode. latency: 0.01353 s, throughput: 2365.70 token/s
Decode. latency: 0.01350 s, throughput: 2369.66 token/s
Decode. latency: 0.01346 s, throughput: 2377.09 token/s
Decode. latency: 0.01354 s, throughput: 2363.53 token/s
Decode. median latency: 0.01364 s, median throughput: 2346.63 token/s
Total. latency: 4.614 s, throughput: 8876.73 token/s
# python3 -m sglang.bench_latency --batch-size 32 --input 1024 --output 256 --model amd/Meta-Llama-3.1-70B-Instruct-FP8-KV --tp 8 --quantization fp8 --kv-cache-dtype fp8_e5m2
Benchmark ...
Prefill. latency: 1.16278 s, throughput: 28180.77 token/s
Decode. latency: 0.01554 s, throughput: 2059.15 token/s
Decode. latency: 0.01456 s, throughput: 2197.55 token/s
Decode. latency: 0.01453 s, throughput: 2202.10 token/s
Decode. latency: 0.01452 s, throughput: 2204.34 token/s
Decode. latency: 0.01453 s, throughput: 2202.13 token/s
Decode. median latency: 0.01471 s, median throughput: 2175.89 token/s
Total. latency: 4.886 s, throughput: 8383.15 token/s
Checklist
Motivation
Decode latency is notable slower with
--kv-cache-dtype fp8_e5m2
, due to design choice oftorch.view(dtype=)
.Related resources
No response