sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.9k stars 479 forks source link

[Bug] GPU Memory Not Releasing After Testing and "OOM" During Stress Testing #951

Closed hxer7963 closed 3 months ago

hxer7963 commented 3 months ago

Checklist

Describe the bug

Description: When testing the sglang LLM inference framework, I observed that the GPU memory utilization increased from 73.16 GiB to 77.06 GiB. After completing the tests, the GPU memory did not return to its initial state.

Following this, I started the stress testing script again, and encountered the following error from sglang serve:

[gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 2147, #cached-token: 0, cache hit rate: 21.58%, #running-req: 0, #queue-req: 0
[gpu_id=0] Prefill batch. #new-seq: 3, #new-token: 6417, #cached-token: 0, cache hit rate: 21.42%, #running-req: 1, #queue-req: 0
[gpu_id=0] Prefill batch. #new-seq: 6, #new-token: 6418, #cached-token: 6495, cache hit rate: 21.85%, #running-req: 4, #queue-req: 0
Exception in ModelTpServer:
Traceback (most recent call last):
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step
    self.forward_step()
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward
    return self.forward_extend(batch)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/model_runner.py", line 293, in forward_extend
    input_metadata = InputMetadata.create(
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/infer_batch.py", line 770, in create
    init_flashinfer_args(
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/infer_batch.py", line 910, in init_flashinfer_args
    model_runner.flashinfer_prefill_wrapper_paged.begin_forward(
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/flashinfer/prefill.py", line 778, in begin_forward
    self._wrapper.begin_forward(
RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor

Expected Behavior:

Actual Behavior:

Reproduction

[gpu_id=0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.86 GB [gpu_id=0] Memory pool end. avail mem=7.61 GB [gpu_id=0] Capture cuda graph begin. This can take up to several minutes. [gpu_id=0] max_total_num_tokens=452003, max_prefill_tokens=16384, max_running_requests=2047, context_len=131072 INFO: Started server process [130608] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) INFO: 127.0.0.1:40952 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 INFO: 127.0.0.1:40954 - "POST /generate HTTP/1.1" 200 OK The server is fired up and ready to roll!


- benchmark
```bash
python3 bench_serving.py --backend srt --port 30000 --tokenizer meta-llama/Meta-Llama-3.1-8B --num-prompt 300 --request-rate 16 --input-len 2048 --output-len 128
Namespace(backend='srt', host='http://localhost', port=30000, dataset=None, input_len=2048, output_len=128, range_ratio=1.0, tokenizer='meta-llama/Meta-Llama-3.1-8B', best_of=1, use_beam_search=False, num_prompts=300, request_rate=16.0, seed=0, trust_remote_code=False)
100%|█████████████████████████████████████████████████████████████| 300/300 [00:36<00:00,  8.28it/s]
Total time: 55.23 s
Request throughput: 5.43 requests/s
Decoding throughput: 695.25 token/s
Average latency: 35.28 s
Average latency per token: 0.02 s
Average latency per output token: 0.28 s

I started the stress testing script again, and encountered the OOM error from sglang serve.

Environment

Python: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA A800-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.107
CUDA Driver Version: 470.182.03
PyTorch: 2.3.1+cu121
sglang: 0.2.5
flashinfer: 0.1.1+cu121torch2.3
requests: 2.32.3
tqdm: 4.66.4
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.111.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.2
interegular: 0.3.3
packaging: 24.1
pillow: Module Not Found
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.3
uvloop: 0.19.0
zmq: 26.0.3
vllm: 0.5.3.post1
openai: 1.37.1
anthropic: 0.31.2
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity
GPU0     X      0-115   0

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576
zhyncs commented 3 months ago

Try this https://github.com/sgl-project/sglang/blob/main/test/killall_sglang.sh

zhyncs commented 3 months ago

@hxer7963 May you try the latest version?

zhyncs commented 3 months ago

I tested with v0.2.10 and it works well.

# server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B

# run twice
python3 -m sglang.bench_serving --backend sglang --port 30000 --tokenizer meta-llama/Meta-Llama-3.1-8B --num-prompts 300 --request-rate 16 --random-input-len 2048 --random-output-len 128 --random-range-ratio 1 --dataset-name random
hxer7963 commented 3 months ago

@hxer7963 May you try the latest version?

The latest version appears to have resolved the issue, and the GPU memory usage was fixed during testing. Awesome, thanks to zhyncs.