I'm attempting to run a multi-node, multi-GPU inference setup using vLLM with pipeline parallelism.
However, I'm encountering an error related to the number of available GPUs.
(llama_env) kogans@vinaka-vaka-levu:~$ nvidia-smi
Mon Oct 7 13:19:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 ... On | 00000000:01:00.0 Off | N/A |
| N/A 36C P8 2W / 65W | 8MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2766 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
My Environment Information:
vLLM version: 0.6.1.dev238+ge2c6e0a82
PyTorch version: 2.4.0+cu121
CUDA version: 12.0.140
OS: Ubuntu 24.04.1 LTS (x86_64)
GPU: NVIDIA GeForce RTX 2060 with Max-Q Design
Nvidia driver version: 535.183.06
How to Reproduce Error:
1. Download HuggingFace Model
2. Set Up Docker Environment with Ray and start sh on both machine one as head and other as worker
3. Run this Command: vllm serve Path/to/model/ --tensor-parallel-size 1 --pipeline-parallel-size 2
Error Log:
INFO 10-04 16:11:19 config.py:1652] Downcasting torch.float32 to torch.float16.
INFO 10-04 16:11:19 config.py:899] Defaulting to use ray for distributed inference
WARNING 10-04 16:11:19 config.py:370] Async output processing can not be enabled with pipeline parallel
2024-10-04 16:11:20,458 INFO worker.py:1786 -- Started a local Ray instance.
Traceback (most recent call last):
File "/home/kogans/llama_project/llama_env/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
engine_client = build_engine()
^^^^^^^^^^^^^^
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
initialize_ray_cluster(engine_config.parallel_config)
File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/executor/ray_utils.py", line 270, in initialize_ray_cluster
raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the placement group.
Expected Behavior:
The command should successfully start a vLLM server with the specified pipeline parallel configuration.
Attempts made to fix error:
I have ran Ray Status to see number of of GPU available and was confirmed I have 2 GPUS available
I check CUDA status and discovered that on one laptop CUDA isn't discoverable with the current set up
I attempted a couple attempts to connect CUDA before going for a fresh install on CUDA.
How would you like to use vllm
I want to run inference on a local hugging face model and I am having issues integrating the model on Vllm and running it on multiple gpus and multiple nodes.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
How would you like to use vllm
I want to run inference on a local hugging face model and I am having issues integrating the model on Vllm and running it on multiple gpus and multiple nodes.
Before submitting a new issue...