sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.13k stars 362 forks source link

[Bug] sgLang v0.3 breaks TP8 Llama 3.1 405B FP8 on 8xH100 #1362

Open jischein opened 1 week ago

jischein commented 1 week ago

Checklist

Describe the bug

Running sgLang with the following parameters

[17:25:03] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer_path='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', chat_template=None, is_embedding=False, host='127.0.0.1', port=8001, additional_ports=[8101, 8201, 8301, 8401], mem_fraction_static=0.9, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=8, stream_interval=1, random_seed=662910014, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None)

On the latest update leads to the following exception:

  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 180, in capture
    self.register_graph_buffers()
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 222, in register_graph_buffers
    handles, offsets = self._gather_ipc_meta((bytes(handle), offset))
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 200, in _gather_ipc_meta
    dist.broadcast_object_list(all_data[i],
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.124.52]:58593

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 885, in run_tp_server
    model_server = ModelTpServer(
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 99, in __init__
    self.model_runner = ModelRunner(
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 118, in __init__
    self.init_cuda_graphs()
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 513, in init_cuda_graphs
    raise Exception(
Exception: Capture cuda graph failed: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.124.52]:58593
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

If I revert my package back to v0.2.13 with identical parameters set in ServerArgs, the server starts per usual.

Reproduction

server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer_path='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', chat_template=None, is_embedding=False, host='127.0.0.1', port=8001, additional_ports=[8101, 8201, 8301, 8401], mem_fraction_static=0.9, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=8, stream_interval=1, random_seed=662910014, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None)

Environment

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --port 8001 --host 127.0.0.1 --tp 8 --mem-fraction-static 0.9 --chunked-prefill-size 8192 --additional-ports 8101 8201 8301 8401

zhyncs commented 1 week ago

I will take a look at this issue asap.

zhyncs commented 1 week ago
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-FP8 --tp 8 --disable-radix --mem-frac 0.87

It works well for me on 8xH100 SXM.

zhyncs commented 1 week ago

@jischein May you provide the env info with python3 -m sglang.check_env

zhyncs commented 1 week ago
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    16.0
Successful requests:                     1000
Benchmark duration (s):                  148.44
Total input tokens:                      224442
Total generated tokens:                  190594
Total generated tokens (retokenized):    188521
Request throughput (req/s):              6.74
Input token throughput (tok/s):          1512.00
Output token throughput (tok/s):         1283.98
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39661.89
Median E2E Latency (ms):                 36512.17
---------------Time to First Token----------------
Mean TTFT (ms):                          3623.75
Median TTFT (ms):                        913.59
P99 TTFT (ms):                           20027.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          276.89
Median TPOT (ms):                        230.55
P99 TPOT (ms):                           773.36
---------------Inter-token Latency----------------
Mean ITL (ms):                           197.70
Median ITL (ms):                         112.90
P99 ITL (ms):                            1414.24
==================================================
merrymercy commented 1 week ago

@jischein You need to reduce --mem-fraction-static 0.9. Cuda graphs take some memory so we need to leave some memory for them

jischein commented 6 days ago
/opt/vllm-foundry/env/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:
* 'underscore_attrs_are_private' has been removed
  warnings.warn(message, UserWarning)
Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.91
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
sglang: 0.3.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.114.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.1
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.44.1
anthropic: 0.34.2
NVIDIA Topology:
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  PHB PHB PHB PHB PHB PHB PHB PHB 0-191   0       N/A
NIC0    PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB PHB PHB PHB
NIC1    PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB PHB PHB
NIC2    PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB PHB
NIC3    PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB
NIC4    PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB
NIC5    PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB
NIC6    PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB
NIC7    PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

Hypervisor vendor: KVM
ulimit soft: 1024
jischein commented 6 days ago

@jischein You need to reduce --mem-fraction-static 0.9. Cuda graphs take some memory so we need to leave some memory for them

@merrymercy — this was it! setting --mem-fraction-static 0.87 worked. thank you both.

but still worth noting: my server was starting with --mem-fraction-static 0.9 on version 0.2.13; imagine there is now increased memory overhead and/or changes introduced to memory management that led to this? nothing else had changed in my environment, config, etc. except me upgrading versions (and downgrading — while maintaining --mem-fraction-static 0.9 — fixed things)