sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.93k stars 485 forks source link

[Bug] RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 458752000 and alignment 16 in AlignedAllocator #1405

Open josephydu opened 1 month ago

josephydu commented 1 month ago

Checklist

Describe the bug

I run the same benchmark script in the following two commits: old: cb99ba4fc6194e4feffa0fbb22223ab0119e5e36 new:c33d82a2111434f02159cd97e02f3cb6657595a4 I run it failed in the new commit but sccussed in the old commit. I get the following error output: image

Reproduction

server: python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.8 --dp-size 2 --load-balance-method round_robin benchmark: python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer Qwen/Qwen2-7B --model Qwen/Qwen2-7B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --request-rate 15.7 --num-prompts 200

Environment

I run the script in A100 40G 8GPU

merrymercy commented 1 month ago

cc @yzh119 @zhyncs

merrymercy commented 1 month ago

We will take a look soon. In the meanwhile, you can try to increase this value https://github.com/sgl-project/sglang/blob/c33d82a2111434f02159cd97e02f3cb6657595a4/python/sglang/global_config.py#L26

zhyncs commented 1 month ago

Ok I’ll take a look asap

merrymercy commented 1 month ago

@josephydu Can you try it again with sglang v0.3.1.post3?

I run the same command on 8xH100 and did not find any issues.

York-Cheung commented 1 month ago

Same. I use 2 A100, sglang v0.3.1.post3, and cuda graph disabled.

josephydu commented 1 month ago

@josephydu Can you try it again with sglang v0.3.1.post3?

I run the same command on 8xH100 and did not find any issues.

I still got the problem in 8xA100. But when I try to increase flashinfer_workspace_size to 384 * 1024 * 1024 * 2, it works. However, I still don't understand why in the old version the default value for this flashinfer_workspace_size only needs to be 192 * 1024 * 1024, but in the new version it needs to be 384 * 1024 * 1024

dmakhervaks commented 1 month ago

@merrymercy I am also getting the same issue when running llama 405B FP8 from neuralmagic on 8h100s

This is how I launch the server: python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix

and this is the error i get:

"RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 550502400 and alignment 16 in AlignedAllocator"

I get the same error with the following command variations as well

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla --disable-cuda-graph

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --mem-fraction-static 0.7

I checked and this problem does not happen in 0.2.7, but 0.2.14 and onwards it does.

Not sure about in between 0.2.7 and 0.2.14

josephydu commented 1 month ago

@merrymercy I am also getting the same issue when running llama 405B FP8 from neuralmagic on 8h100s

This is how I launch the server: python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix

and this is the error i get:

"RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 550502400 and alignment 16 in AlignedAllocator"

I get the same error with the following command variations as well

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla --disable-cuda-graph

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --mem-fraction-static 0.7

I checked and this problem does not happen in 0.2.7, but 0.2.14 and onwards it does.

Not sure about in between 0.2.7 and 0.2.14

Maybe you can try to increse flashinfer_workspace_size . It can temporarily solve the problem, but the real reason is still unknown. sglang/python/sglang/global_config.py self.flashinfer_workspace_size = 384 * 1024 * 1024

dmakhervaks commented 1 month ago

@josephydu I think I found a pattern, which may help you in debugging this.

0.3.0 and up, if I remove "disable-radix-cache", I do not get the error.

i..e if run this:

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8

instead of

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix-cache

changing size of flashinfer_workspace_size gave me a different issue