Open jischein opened 1 week ago
I will take a look at this issue asap.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-FP8 --tp 8 --disable-radix --mem-frac 0.87
It works well for me on 8xH100 SXM.
@jischein May you provide the env info with python3 -m sglang.check_env
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 16.0
Successful requests: 1000
Benchmark duration (s): 148.44
Total input tokens: 224442
Total generated tokens: 190594
Total generated tokens (retokenized): 188521
Request throughput (req/s): 6.74
Input token throughput (tok/s): 1512.00
Output token throughput (tok/s): 1283.98
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 39661.89
Median E2E Latency (ms): 36512.17
---------------Time to First Token----------------
Mean TTFT (ms): 3623.75
Median TTFT (ms): 913.59
P99 TTFT (ms): 20027.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 276.89
Median TPOT (ms): 230.55
P99 TPOT (ms): 773.36
---------------Inter-token Latency----------------
Mean ITL (ms): 197.70
Median ITL (ms): 112.90
P99 ITL (ms): 1414.24
==================================================
@jischein You need to reduce --mem-fraction-static 0.9
. Cuda graphs take some memory so we need to leave some memory for them
/opt/vllm-foundry/env/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:
* 'underscore_attrs_are_private' has been removed
warnings.warn(message, UserWarning)
Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.91
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
sglang: 0.3.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.114.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.1
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.44.1
anthropic: 0.34.2
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X PHB PHB PHB PHB PHB PHB PHB PHB 0-191 0 N/A
NIC0 PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB PHB PHB
NIC1 PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB PHB
NIC2 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB PHB
NIC3 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB PHB
NIC4 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB PHB
NIC5 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB PHB
NIC6 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X PHB
NIC7 PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
Hypervisor vendor: KVM
ulimit soft: 1024
@jischein You need to reduce
--mem-fraction-static 0.9
. Cuda graphs take some memory so we need to leave some memory for them
@merrymercy — this was it! setting --mem-fraction-static 0.87
worked. thank you both.
but still worth noting: my server was starting with --mem-fraction-static 0.9
on version 0.2.13; imagine there is now increased memory overhead and/or changes introduced to memory management that led to this? nothing else had changed in my environment, config, etc. except me upgrading versions (and downgrading — while maintaining --mem-fraction-static 0.9
— fixed things)
Checklist
Describe the bug
Running sgLang with the following parameters
On the latest update leads to the following exception:
If I revert my package back to v0.2.13 with identical parameters set in
ServerArgs
, the server starts per usual.Reproduction
server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer_path='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', chat_template=None, is_embedding=False, host='127.0.0.1', port=8001, additional_ports=[8101, 8201, 8301, 8401], mem_fraction_static=0.9, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=8, stream_interval=1, random_seed=662910014, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None)
Environment
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --port 8001 --host 127.0.0.1 --tp 8 --mem-fraction-static 0.9 --chunked-prefill-size 8192 --additional-ports 8101 8201 8301 8401