sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.92k stars 482 forks source link

[Bug] OOM for concurrent long requests #1030

Closed hahmad2008 closed 1 month ago

hahmad2008 commented 2 months ago

Checklist

Describe the bug

I am trying to benchmark inference of llama3-8b with long requests, I send 20 concurrent requests each with length of 1k tokens and I set the stream to True and max_tokens to 1024.

This is how I start the server: python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --context-length 4096 --dtype bfloat16 --chat-template llama-3

I add llama-3 template in conversaitons.py as they are present in conversions file of FASTCHAT.

Note: when I send this to VLLM entrypoint, it works without OOM error!

Error:

INFO:   - "POST /v1/chat/completions HTTP/1.1" 200 OK
[gpu=0] Prefill batch. #new-seq: 17, #new-token: 13938, #cached-token: 3260, cache hit rate: 17.86%, #running-req: 1, #queue-req: 2
Exception in ModelTpServer:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 219, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 545, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 388, in forward
    return self.forward_extend(batch)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 356, in forward_extend
    return self.model.forward(
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 314, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 281, in forward
    hidden_states, residual = layer(
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 79, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/vllm/model_executor/layers/linear.py", line 330, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/vllm/model_executor/layers/linear.py", line 122, in apply
    return F.linear(x, layer.weight, bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 764.00 MiB. GPU 

Reproduction

Same as describe the bug section

Environment

$ python -m sglang.check_env

Python: 3.9.19 (main, May  6 2024, 19:43:03) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA A10G
CUDA_HOME: None
PyTorch: 2.3.1+cu121
sglang: 0.2.9.post1
flashinfer: 0.1.3+cu121torch2.3
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.0
fastapi: 0.112.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.5
uvloop: 0.19.0
zmq: 26.1.0
vllm: 0.5.3.post1
multipart: 0.0.9
openai: 1.38.0
anthropic: 0.32.0
NVIDIA Topology: 
    GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-15    0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024
merrymercy commented 2 months ago

Thanks for reporting the errors.

  1. You do not need to specify the chat template. The model you used already has the chat template. https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053
  2. Could you try to add --max-prefill 8192 when you launch the server? i.e.
    python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B-Instruct  --host 0.0.0.0  --port 8000 --context-length 4096 --dtype bfloat16  --max-prefill 8192

If it still OOMs, try to add --max-prefill 4096 or --mem-fraction-static 0.83

hahmad2008 commented 2 months ago

@merrymercy thanks for the prompt response. it works with --max-prefill 4096. btw is the backend VLLM? what are the available backends?

for tokenizer, how it should be if i didn't pass the chat-template, the LLM is loaded without chat template as following: chat_template=None,

server_args=ServerArgs(model_path='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer_path='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer_mode='auto', load_format='auto', dtype='bfloat16', trust_remote_code=False, context_length=4096, quantization=None, served_model_name='NousResearch/Meta-Llama-3-8B-Instruct', chat_template=None, host='0.0.0.0', port=8000, additional_ports=[8001, 8002, 8003, 8004], mem_fraction_static=0.88, max_prefill_tokens=4096, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=746231706, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', file_storage_pth='SGlang_storage', dp_size=1, load_balance_method='round_robin', chunked_prefill_size=None, disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
hahmad2008 commented 2 months ago

btw is there any factors that influence the concurrent requests should I check?

merrymercy commented 2 months ago
qeternity commented 2 months ago

There is definitely a much bigger regression with recent versions around OOM. Previous workloads and configurations that had no issues running large concurrency for long periods of time are now OOM'ing even when running with a fraction of static mem pool and lowered concurrency.

smallblue12138 commented 2 months ago

There is definitely a much bigger regression with recent versions around OOM. Previous workloads and configurations that had no issues running large concurrency for long periods of time are now OOM'ing even when running with a fraction of static mem pool and lowered concurrency.

Hello, is there any solution now

qeternity commented 2 months ago

This commit seems to have improved things (too early to say if resolved, but early testing is working better): https://github.com/sgl-project/sglang/commit/df191254abc002b3284560d9c4b94214a4656265

merrymercy commented 1 month ago

this should be fixed with the latest version. see also https://github.com/sgl-project/sglang/blob/39bb49d156f2319d2aec67c458c2db980bb0f4c3/README.md?plain=1#L221-L224

https://github.com/sgl-project/sglang/blob/39bb49d156f2319d2aec67c458c2db980bb0f4c3/docs/en/hyperparameter_tuning.md?plain=1#L28-L32