sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Apache License 2.0
2.75k stars 177 forks source link

Add sglang.bench_latency for offline benchmark #564

Closed merrymercy closed 6 days ago

merrymercy commented 6 days ago

Usage (latency test):

>>> python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --load-format dummy --tp 2

Prefill. latency:  5.021 ms, throughput:    203.93 token/s
Decode.  latency:  0.331 ms, throughput:      3.02 token/s
Decode.  latency:  0.010 ms, throughput:    102.95 token/s
Decode.  latency:  0.009 ms, throughput:    107.55 token/s
Decode.  latency:  0.009 ms, throughput:    107.70 token/s
Prefill. latency:  0.019 ms, throughput:  52553.25 token/s
Decode.  latency:  0.009 ms, throughput:    108.23 token/s
Decode.  latency:  0.012 ms, throughput:     84.28 token/s
Decode.  latency:  0.009 ms, throughput:    108.52 token/s
Decode.  latency:  0.009 ms, throughput:    108.31 token/s

Usage (correctness test):

>>> python -m sglang.bench_latency --model-path TinyLlama/TinyLlama-1.1B-Chat-v0.4 --correct

prefill logits (first half) tensor([[-10.0312,  -9.5000,   0.8936,  ...,  -4.9414,  -3.2402,  -3.3633],
        [-10.0312,  -9.5000,   0.8936,  ...,  -4.9414,  -3.2402,  -3.3633],
        [ -9.1875, -10.2500,   2.7109,  ...,  -4.3359,  -4.0664,  -4.1328]],
       device='cuda:0', dtype=torch.float16)
prefill logits (final) tensor([[-8.3203, -7.1211,  3.3379,  ..., -4.9570, -4.1328, -3.4141],
        [-8.9062, -9.0156,  4.1445,  ..., -4.9922, -4.4961, -4.0742],
        [-9.6328, -9.0547,  4.0117,  ..., -5.3047, -4.7148, -4.4609]],
       device='cuda:0', dtype=torch.float16)
<s> The capital of France is.
The capital of the United States
<s> The capital of the United Kindom is.
The capital of the United Kingdom
<s> Today is a sunny day and I like go for a walk in the park.