Open josephydu opened 1 month ago
cc @yzh119 @zhyncs
We will take a look soon. In the meanwhile, you can try to increase this value https://github.com/sgl-project/sglang/blob/c33d82a2111434f02159cd97e02f3cb6657595a4/python/sglang/global_config.py#L26
Ok I’ll take a look asap
@josephydu Can you try it again with sglang v0.3.1.post3?
I run the same command on 8xH100 and did not find any issues.
Same. I use 2 A100, sglang v0.3.1.post3, and cuda graph disabled.
@josephydu Can you try it again with sglang v0.3.1.post3?
I run the same command on 8xH100 and did not find any issues.
I still got the problem in 8xA100. But when I try to increase flashinfer_workspace_size
to 384 * 1024 * 1024 * 2
, it works.
However, I still don't understand why in the old version the default value for this flashinfer_workspace_size
only needs to be 192 * 1024 * 1024
, but in the new version it needs to be 384 * 1024 * 1024
@merrymercy I am also getting the same issue when running llama 405B FP8 from neuralmagic on 8h100s
This is how I launch the server: python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix
and this is the error i get:
"RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 550502400 and alignment 16 in AlignedAllocator"
I get the same error with the following command variations as well
python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla
python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla --disable-cuda-graph
python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --mem-fraction-static 0.7
I checked and this problem does not happen in 0.2.7, but 0.2.14 and onwards it does.
Not sure about in between 0.2.7 and 0.2.14
@merrymercy I am also getting the same issue when running llama 405B FP8 from neuralmagic on 8h100s
This is how I launch the server: python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix
and this is the error i get:
"RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 550502400 and alignment 16 in AlignedAllocator"
I get the same error with the following command variations as well
python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla
python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla --disable-cuda-graph
python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --mem-fraction-static 0.7
I checked and this problem does not happen in 0.2.7, but 0.2.14 and onwards it does.
Not sure about in between 0.2.7 and 0.2.14
Maybe you can try to increse flashinfer_workspace_size
. It can temporarily solve the problem, but the real reason is still unknown.
sglang/python/sglang/global_config.py
self.flashinfer_workspace_size = 384 * 1024 * 1024
@josephydu I think I found a pattern, which may help you in debugging this.
0.3.0 and up, if I remove "disable-radix-cache", I do not get the error.
i..e if run this:
python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8
instead of
python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix-cache
changing size of flashinfer_workspace_size gave me a different issue
Checklist
Describe the bug
I run the same benchmark script in the following two commits: old: cb99ba4fc6194e4feffa0fbb22223ab0119e5e36 new:c33d82a2111434f02159cd97e02f3cb6657595a4 I run it failed in the new commit but sccussed in the old commit. I get the following error output:
Reproduction
server: python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.8 --dp-size 2 --load-balance-method round_robin benchmark: python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer Qwen/Qwen2-7B --model Qwen/Qwen2-7B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --request-rate 15.7 --num-prompts 200
Environment
I run the script in A100 40G 8GPU