Closed hxer7963 closed 3 months ago
@hxer7963 May you try the latest version?
I tested with v0.2.10 and it works well.
# server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B
# run twice
python3 -m sglang.bench_serving --backend sglang --port 30000 --tokenizer meta-llama/Meta-Llama-3.1-8B --num-prompts 300 --request-rate 16 --random-input-len 2048 --random-output-len 128 --random-range-ratio 1 --dataset-name random
@hxer7963 May you try the latest version?
The latest version appears to have resolved the issue, and the GPU memory usage was fixed during testing. Awesome, thanks to zhyncs.
Checklist
Describe the bug
Description: When testing the sglang LLM inference framework, I observed that the GPU memory utilization increased from 73.16 GiB to 77.06 GiB. After completing the tests, the GPU memory did not return to its initial state.
Following this, I started the stress testing script again, and encountered the following error from sglang serve:
Expected Behavior:
Actual Behavior:
Reproduction
[gpu_id=0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.86 GB [gpu_id=0] Memory pool end. avail mem=7.61 GB [gpu_id=0] Capture cuda graph begin. This can take up to several minutes. [gpu_id=0] max_total_num_tokens=452003, max_prefill_tokens=16384, max_running_requests=2047, context_len=131072 INFO: Started server process [130608] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) INFO: 127.0.0.1:40952 - "GET /get_model_info HTTP/1.1" 200 OK [gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0 INFO: 127.0.0.1:40954 - "POST /generate HTTP/1.1" 200 OK The server is fired up and ready to roll!
I started the stress testing script again, and encountered the OOM error from sglang serve.
Environment