Open yecphaha opened 4 months ago
Might be related to https://github.com/vllm-project/vllm/pull/5158
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
Deploy using llama factory CUDA_VISIBLE_DEVICES=0 API_PORT=9092 python src/api_demo.py --model_name_or_path /save_model/qwen1_5_7b_pcb_merge --template qwen --infer_backend vllm --max_new_tokens 32768 --vllm_maxlen 32768 --vllm_enforce_eager --vllm_gpu_util 0.95
Inference environment: Python=3.10.14 CUDA=12.2 single A100 80G
🐛 Describe the bug
There is no other GPU usage during deployment. The expected GPU usage is 76G, but the actual GPU usage is 55G. What is the reason?