vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.51k stars 4.43k forks source link

[Bug]: GPU memory usage is inconsistent with gpu_memory_utilization settings #5305

Open yecphaha opened 4 months ago

yecphaha commented 4 months ago

Your current environment

Deploy using llama factory CUDA_VISIBLE_DEVICES=0 API_PORT=9092 python src/api_demo.py --model_name_or_path /save_model/qwen1_5_7b_pcb_merge --template qwen --infer_backend vllm --max_new_tokens 32768 --vllm_maxlen 32768 --vllm_enforce_eager --vllm_gpu_util 0.95

Inference environment: Python=3.10.14 CUDA=12.2 single A100 80G

🐛 Describe the bug

There is no other GPU usage during deployment. The expected GPU usage is 76G, but the actual GPU usage is 55G. What is the reason?

vrdn-23 commented 4 months ago

Might be related to https://github.com/vllm-project/vllm/pull/5158

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!