hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.2k stars 4.08k forks source link

vllm部署,GPU显存占用与vllm_gpu_util设置不一致 #4056

Closed yecphaha closed 4 months ago

yecphaha commented 4 months ago

Reminder

System Info

CUDA_VISIBLE_DEVICES=0 API_PORT=9092 python src/api_demo.py \ --model_name_or_path /save_model/qwen1_5_7b_pcb_merge \ --template qwen \ --infer_backend vllm \ --max_new_tokens 32768 \ --vllm_maxlen 32768 \ --vllm_enforce_eager \ --vllm_gpu_util 0.95

Reproduction

推理环境: Python=3.10.14 CUDA=12.2,单张A100 80G显卡

Expected behavior

部署时没有其他GPU占用,期望的GPU占用是76G,但实际的GPU占用是55G,请问是什么原因?

Others

No response

hiyouga commented 4 months ago

不是 llamafactory 的问题