Open zhentingqi opened 2 months ago
the root cause should be:
[rank0]: RuntimeError: CUDART error: CUDA-capable device(s) is/are busy or unavailable
please contact your admin, or try to reboot the machine.
We encountered the same problem when several processes try to access the same GPU (like in the vllm peer to peer check) and with a nvidia configuration only allowing 1 process per GPU. @teojgo found a workaround to just skip the peer to peer check: export VLLM_SKIP_P2P_CHECK=1
, another one is to enable cuda MPS
I have also encounted the same problem in my cluster, looking forward to the solution.
Your current environment
How would you like to use vllm
I want to run inference of a "Qwen/Qwen1.5-72B-Chat".
code:
error:
Before submitting a new issue...