Issue Running Inference on a Model with Multi Nodes and Inference

Your current environment

I'm attempting to run a multi-node, multi-GPU inference setup using vLLM with pipeline parallelism. 

However, I'm encountering an error related to the number of available GPUs.

(llama_env) kogans@vinaka-vaka-levu:~$ nvidia-smi
Mon Oct  7 13:19:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   36C    P8               2W /  65W |      8MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2766      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

My Environment Information:

vLLM version: 0.6.1.dev238+ge2c6e0a82
PyTorch version: 2.4.0+cu121
CUDA version: 12.0.140
OS: Ubuntu 24.04.1 LTS (x86_64)
GPU: NVIDIA GeForce RTX 2060 with Max-Q Design
Nvidia driver version: 535.183.06

How to Reproduce Error:

1. Download HuggingFace Model 
2. Set Up Docker Environment with Ray and start sh on both machine one as head and other as worker
3.  Run this Command: vllm serve Path/to/model/ --tensor-parallel-size 1 --pipeline-parallel-size 2

Error Log:

INFO 10-04 16:11:19 config.py:1652] Downcasting torch.float32 to torch.float16.
INFO 10-04 16:11:19 config.py:899] Defaulting to use ray for distributed inference
WARNING 10-04 16:11:19 config.py:370] Async output processing can not be enabled with pipeline parallel
2024-10-04 16:11:20,458 INFO worker.py:1786 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/home/kogans/llama_project/llama_env/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/scripts.py", line 165, in main
    args.dispatch_function(args)
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/scripts.py", line 37, in serve
    uvloop.run(run_server(args))
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
    engine_client = build_engine()
                    ^^^^^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
    initialize_ray_cluster(engine_config.parallel_config)
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/executor/ray_utils.py", line 270, in initialize_ray_cluster
    raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the placement group.

Expected Behavior:

The command should successfully start a vLLM server with the specified pipeline parallel configuration.

Attempts made to fix error:

I have ran Ray Status to see number of of GPU available and was confirmed I have 2 GPUS available

I check CUDA status and discovered that on one laptop CUDA isn't discoverable with the current set up
I attempted a couple attempts to connect CUDA before going for a fresh install on CUDA.

How would you like to use vllm

I want to run inference on a local hugging face model and I am having issues integrating the model on Vllm and running it on multiple gpus and multiple nodes.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm

Issue Running Inference on a Model with Multi Nodes and Inference #9134

Your current environment

How would you like to use vllm

Before submitting a new issue...