[Bug] OOM for concurrent long requests

hahmad2008 commented 2 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.

Describe the bug

I am trying to benchmark inference of llama3-8b with long requests, I send 20 concurrent requests each with length of 1k tokens and I set the stream to True and max_tokens to 1024.

This is how I start the server: python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --context-length 4096 --dtype bfloat16 --chat-template llama-3

I add llama-3 template in conversaitons.py as they are present in conversions file of FASTCHAT.

Note: when I send this to VLLM entrypoint, it works without OOM error!

Error:

INFO:   - "POST /v1/chat/completions HTTP/1.1" 200 OK
[gpu=0] Prefill batch. #new-seq: 17, #new-token: 13938, #cached-token: 3260, cache hit rate: 17.86%, #running-req: 1, #queue-req: 2
Exception in ModelTpServer:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 219, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 545, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 388, in forward
    return self.forward_extend(batch)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 356, in forward_extend
    return self.model.forward(
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 314, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 281, in forward
    hidden_states, residual = layer(
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 79, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/vllm/model_executor/layers/linear.py", line 330, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/home/ubuntu/miniconda3/envs/sglang-env/lib/python3.9/site-packages/vllm/model_executor/layers/linear.py", line 122, in apply
    return F.linear(x, layer.weight, bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 764.00 MiB. GPU

Reproduction

Same as describe the bug section

Environment

$ python -m sglang.check_env

Python: 3.9.19 (main, May  6 2024, 19:43:03) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA A10G
CUDA_HOME: None
PyTorch: 2.3.1+cu121
sglang: 0.2.9.post1
flashinfer: 0.1.3+cu121torch2.3
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.0
fastapi: 0.112.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.5
uvloop: 0.19.0
zmq: 26.1.0
vllm: 0.5.3.post1
multipart: 0.0.9
openai: 1.38.0
anthropic: 0.32.0
NVIDIA Topology: 
    GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-15    0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

merrymercy commented 2 months ago

Thanks for reporting the errors.

You do not need to specify the chat template. The model you used already has the chat template. https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053

Could you try to add --max-prefill 8192 when you launch the server? i.e.

python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B-Instruct  --host 0.0.0.0  --port 8000 --context-length 4096 --dtype bfloat16  --max-prefill 8192

If it still OOMs, try to add --max-prefill 4096 or --mem-fraction-static 0.83

hahmad2008 commented 2 months ago

@merrymercy thanks for the prompt response. it works with --max-prefill 4096. btw is the backend VLLM? what are the available backends?

for tokenizer, how it should be if i didn't pass the chat-template, the LLM is loaded without chat template as following: chat_template=None,

server_args=ServerArgs(model_path='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer_path='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer_mode='auto', load_format='auto', dtype='bfloat16', trust_remote_code=False, context_length=4096, quantization=None, served_model_name='NousResearch/Meta-Llama-3-8B-Instruct', chat_template=None, host='0.0.0.0', port=8000, additional_ports=[8001, 8002, 8003, 8004], mem_fraction_static=0.88, max_prefill_tokens=4096, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=746231706, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', file_storage_pth='SGlang_storage', dp_size=1, load_balance_method='round_robin', chunked_prefill_size=None, disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)

hahmad2008 commented 2 months ago

btw is there any factors that influence the concurrent requests should I check?

merrymercy commented 2 months ago

We use the chat template in the hugging face tokenizer by default if it is available https://github.com/sgl-project/sglang/blob/8207637029082563cab74951fe8d5f86b574b85e/python/sglang/srt/openai_api/adapter.py#L701-L70, so you do not need to specify that chat template.
SGLang is a standalone server. It uses attention CUDA kernels from flashinfer, linear layers from vLLM, and triton operators from torch.compile. For your 8B model, you can try to add --enable-torch-compile, which will be faster. It is not very precise to say that "the backend is vLLM", as SGLang integrates many high-performance components.
See this guide on tuning the parameters for high throughput https://github.com/sgl-project/sglang/blob/main/docs/en/hyperparameter_tuning.md

qeternity commented 2 months ago

There is definitely a much bigger regression with recent versions around OOM. Previous workloads and configurations that had no issues running large concurrency for long periods of time are now OOM'ing even when running with a fraction of static mem pool and lowered concurrency.

smallblue12138 commented 2 months ago

There is definitely a much bigger regression with recent versions around OOM. Previous workloads and configurations that had no issues running large concurrency for long periods of time are now OOM'ing even when running with a fraction of static mem pool and lowered concurrency.

Hello, is there any solution now

qeternity commented 2 months ago

This commit seems to have improved things (too early to say if resolved, but early testing is working better): https://github.com/sgl-project/sglang/commit/df191254abc002b3284560d9c4b94214a4656265

merrymercy commented 1 month ago

this should be fixed with the latest version. see also https://github.com/sgl-project/sglang/blob/39bb49d156f2319d2aec67c458c2db980bb0f4c3/README.md?plain=1#L221-L224

https://github.com/sgl-project/sglang/blob/39bb49d156f2319d2aec67c458c2db980bb0f4c3/docs/en/hyperparameter_tuning.md?plain=1#L28-L32

sgl-project / sglang