Closed zhyncs closed 1 month ago
Update: It's just a bit slow, but it can still run. It's just that the speed is incredibly slow.
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 1000
Benchmark duration (s): 1472.60
Total input tokens: 215196
Total generated tokens: 198343
Total generated tokens (retokenized): 197285
Request throughput (req/s): 0.68
Input token throughput (tok/s): 146.13
Output token throughput (tok/s): 134.69
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 1086277.13
Median E2E Latency (ms): 1086581.10
---------------Time to First Token----------------
Mean TTFT (ms): 463420.11
Median TTFT (ms): 490054.93
P99 TTFT (ms): 763179.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 10909.16
Median TPOT (ms): 3128.24
P99 TPOT (ms): 120089.61
---------------Inter-token Latency----------------
Mean ITL (ms): 3173.50
Median ITL (ms): 1764.77
P99 ITL (ms): 3636.58
==================================================
It's possible that the server is using some kind of hypervisor, which causes the link between the gpus to be very slow, which can seriously affect performance. I'm in a similar situation when using multiple L40S+vllm on a server that uses KVM. May help.
@billvsme Thanks for your info. What parameters did you adjust to solve this problem?
Replace a machine that doesn't use KVM and the speed will be normal.
Note:when I change the machine, PHB -> SYS
Thanks!
Checklist
Describe the bug
When using ShareGPT 1k, no results can be obtained.
10 is normal, but it keeps getting stuck after changing to 1000.
Reproduction
Environment