[Bug] GPU Memory Not Releasing After Testing and "OOM" During Stress Testing

hxer7963 commented 3 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Description: When testing the sglang LLM inference framework, I observed that the GPU memory utilization increased from 73.16 GiB to 77.06 GiB. After completing the tests, the GPU memory did not return to its initial state.

Following this, I started the stress testing script again, and encountered the following error from sglang serve:

[gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 2147, #cached-token: 0, cache hit rate: 21.58%, #running-req: 0, #queue-req: 0
[gpu_id=0] Prefill batch. #new-seq: 3, #new-token: 6417, #cached-token: 0, cache hit rate: 21.42%, #running-req: 1, #queue-req: 0
[gpu_id=0] Prefill batch. #new-seq: 6, #new-token: 6418, #cached-token: 6495, cache hit rate: 21.85%, #running-req: 4, #queue-req: 0
Exception in ModelTpServer:
Traceback (most recent call last):
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/tp_worker.py", line 186, in exposed_step
    self.forward_step()
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/tp_worker.py", line 202, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/tp_worker.py", line 443, in forward_prefill_batch
    output = self.model_runner.forward(batch, ForwardMode.EXTEND)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/model_runner.py", line 336, in forward
    return self.forward_extend(batch)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/model_runner.py", line 293, in forward_extend
    input_metadata = InputMetadata.create(
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/infer_batch.py", line 770, in create
    init_flashinfer_args(
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller/infer_batch.py", line 910, in init_flashinfer_args
    model_runner.flashinfer_prefill_wrapper_paged.begin_forward(
  File "/mnt/llm_dataset/willhe/miniconda3/envs/sglang/lib/python3.10/site-packages/flashinfer/prefill.py", line 778, in begin_forward
    self._wrapper.begin_forward(
RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor

Expected Behavior:

GPU memory should return to the initial state after the completion of tests.
No runtime errors should occur when starting the stress testing script again.

Actual Behavior:

GPU memory utilization increased from 73.16 GiB to 77.06 GiB and did not decrease after testing.
Encountered "Runtime Error: Out of workspace memory in AligndAllocator" during the subsequent stress testing.

Reproduction

launch server


python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B
server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B', tokenizer_path='meta-llama/Meta-Llama-3.1-8B', tokenizer_mode='auto', load_format='auto', dtype='auto', trust_remote_code=False, context_length=None, quantization=None, chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.89, max_prefill_tokens=None, max_running_requests=None, schedule_heuristic='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=408493183, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, attention_reduce_in_fp32=False, enable_p2p_check=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu_id=0] Init nccl begin.
[gpu_id=0] Load weight begin. avail mem=78.94 GB
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.64s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.02s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.30s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.50s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.41s/it]

[gpu_id=0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.86 GB [gpu_id=0] Memory pool end. avail mem=7.61 GB [gpu_id=0] Capture cuda graph begin. This can take up to several minutes. [gpu_id=0] max_total_num_tokens=452003, max_prefill_tokens=16384, max_running_requests=2047, context_len=131072 INFO: Started server process [130608] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) INFO: 127.0.0.1:40952 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 INFO: 127.0.0.1:40954 - "POST /generate HTTP/1.1" 200 OK The server is fired up and ready to roll!


- benchmark
```bash
python3 bench_serving.py --backend srt --port 30000 --tokenizer meta-llama/Meta-Llama-3.1-8B --num-prompt 300 --request-rate 16 --input-len 2048 --output-len 128
Namespace(backend='srt', host='http://localhost', port=30000, dataset=None, input_len=2048, output_len=128, range_ratio=1.0, tokenizer='meta-llama/Meta-Llama-3.1-8B', best_of=1, use_beam_search=False, num_prompts=300, request_rate=16.0, seed=0, trust_remote_code=False)
100%|█████████████████████████████████████████████████████████████| 300/300 [00:36<00:00,  8.28it/s]
Total time: 55.23 s
Request throughput: 5.43 requests/s
Decoding throughput: 695.25 token/s
Average latency: 35.28 s
Average latency per token: 0.02 s
Average latency per output token: 0.28 s

I started the stress testing script again, and encountered the OOM error from sglang serve.

Environment

Python: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA A800-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.107
CUDA Driver Version: 470.182.03
PyTorch: 2.3.1+cu121
sglang: 0.2.5
flashinfer: 0.1.1+cu121torch2.3
requests: 2.32.3
tqdm: 4.66.4
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.111.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.2
interegular: 0.3.3
packaging: 24.1
pillow: Module Not Found
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.3
uvloop: 0.19.0
zmq: 26.0.3
vllm: 0.5.3.post1
openai: 1.37.1
anthropic: 0.31.2
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity
GPU0     X      0-115   0

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

zhyncs commented 3 months ago

Try this https://github.com/sgl-project/sglang/blob/main/test/killall_sglang.sh

zhyncs commented 3 months ago

@hxer7963 May you try the latest version?

zhyncs commented 3 months ago

I tested with v0.2.10 and it works well.

# server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B

# run twice
python3 -m sglang.bench_serving --backend sglang --port 30000 --tokenizer meta-llama/Meta-Llama-3.1-8B --num-prompts 300 --request-rate 16 --random-input-len 2048 --random-output-len 128 --random-range-ratio 1 --dataset-name random

hxer7963 commented 3 months ago

@hxer7963 May you try the latest version?

The latest version appears to have resolved the issue, and the GPU memory usage was fixed during testing. Awesome, thanks to zhyncs.

sgl-project / sglang