sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.11k stars 513 forks source link

[Feature, Performance] kv cache performance improvement #2087

Open HaiShaw opened 3 hours ago

HaiShaw commented 3 hours ago

Checklist

Motivation

Decode latency is notable slower with --kv-cache-dtype fp8_e5m2, due to design choice of torch.view(dtype=).

Related resources

No response

HaiShaw commented 3 hours ago

This is observed on H100 as well as MI300X. Expect some design changes (may assign to me)

HaiShaw commented 3 hours ago

On H100:

# python3 -m sglang.bench_latency --batch-size 32 --input 1024 --output 256 --model amd/Meta-Llama-3.1-70B-Instruct-FP8-KV --tp 8 --quantization fp8
Benchmark ...
Prefill. latency: 1.15925 s, throughput:  28266.67 token/s
Decode.  latency: 0.01402 s, throughput:   2281.72 token/s
Decode.  latency: 0.01353 s, throughput:   2365.70 token/s
Decode.  latency: 0.01350 s, throughput:   2369.66 token/s
Decode.  latency: 0.01346 s, throughput:   2377.09 token/s
Decode.  latency: 0.01354 s, throughput:   2363.53 token/s
Decode.  median latency: 0.01364 s, median throughput:   2346.63 token/s
Total. latency:  4.614 s, throughput:   8876.73 token/s

# python3 -m sglang.bench_latency --batch-size 32 --input 1024 --output 256 --model amd/Meta-Llama-3.1-70B-Instruct-FP8-KV --tp 8 --quantization fp8 --kv-cache-dtype fp8_e5m2
Benchmark ...
Prefill. latency: 1.16278 s, throughput:  28180.77 token/s
Decode.  latency: 0.01554 s, throughput:   2059.15 token/s
Decode.  latency: 0.01456 s, throughput:   2197.55 token/s
Decode.  latency: 0.01453 s, throughput:   2202.10 token/s
Decode.  latency: 0.01452 s, throughput:   2204.34 token/s
Decode.  latency: 0.01453 s, throughput:   2202.13 token/s
Decode.  median latency: 0.01471 s, median throughput:   2175.89 token/s
Total. latency:  4.886 s, throughput:   8383.15 token/s