sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.91k stars 479 forks source link

[Bug] Llama3 70B A100 PCIE TP4 slow speed #1137

Closed zhyncs closed 1 month ago

zhyncs commented 2 months ago

Checklist

Describe the bug

When using ShareGPT 1k, no results can be obtained.

10 is normal, but it keeps getting stuck after changing to 1000.

Initial test run completed. Starting main benchmark run...
  0%|                                              | 0/1000 [00:00<?, ?it/s]
``` ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 10 Benchmark duration (s): 42.87 Total input tokens: 1369 Total generated tokens: 2278 Total generated tokens (retokenized): 2268 Request throughput (req/s): 0.23 Input token throughput (tok/s): 31.93 Output token throughput (tok/s): 53.14 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 16760.78 Median E2E Latency (ms): 11625.66 ---------------Time to First Token---------------- Mean TTFT (ms): 4175.83 Median TTFT (ms): 4582.61 P99 TTFT (ms): 4774.11 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 66.01 Median TPOT (ms): 64.35 P99 TPOT (ms): 100.07 ---------------Inter-token Latency---------------- Mean ITL (ms): 55.78 Median ITL (ms): 50.80 P99 ITL (ms): 106.39 ================================================== ```

Reproduction

# server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --tp 4 --disable-radix-cache --enable-p2p-check

# client
python -m sglang.bench_serving --backend sglang --num-prompts 10

python -m sglang.bench_serving --backend sglang --num-prompts 1000

Environment

Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A100 80GB PCIe
GPU 0,1,2,3 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 545.23.08
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.3
aiohttp: 3.10.3
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.2.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 24.0.1
vllm: 0.5.4
multipart: 0.0.9
openai: 1.41.0
anthropic: 0.34.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-251   0               N/A
GPU1    PHB      X      PHB     PHB     0-251   0               N/A
GPU2    PHB     PHB      X      PHB     0-251   0               N/A
GPU3    PHB     PHB     PHB      X      0-251   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576
zhyncs commented 2 months ago

Update: It's just a bit slow, but it can still run. It's just that the speed is incredibly slow.

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  1472.60
Total input tokens:                      215196
Total generated tokens:                  198343
Total generated tokens (retokenized):    197285
Request throughput (req/s):              0.68
Input token throughput (tok/s):          146.13
Output token throughput (tok/s):         134.69
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1086277.13
Median E2E Latency (ms):                 1086581.10
---------------Time to First Token----------------
Mean TTFT (ms):                          463420.11
Median TTFT (ms):                        490054.93
P99 TTFT (ms):                           763179.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10909.16
Median TPOT (ms):                        3128.24
P99 TPOT (ms):                           120089.61
---------------Inter-token Latency----------------
Mean ITL (ms):                           3173.50
Median ITL (ms):                         1764.77
P99 ITL (ms):                            3636.58
==================================================
billvsme commented 2 months ago

It's possible that the server is using some kind of hypervisor, which causes the link between the gpus to be very slow, which can seriously affect performance. I'm in a similar situation when using multiple L40S+vllm on a server that uses KVM. May help.

zhyncs commented 2 months ago

@billvsme Thanks for your info. What parameters did you adjust to solve this problem?

billvsme commented 2 months ago

Replace a machine that doesn't use KVM and the speed will be normal.

Note:when I change the machine, PHB -> SYS

zhyncs commented 2 months ago

Thanks!