vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.38k stars 4.2k forks source link

Issue Running Inference on a Model with Multi Nodes and Inference #9134

Open kogans1107 opened 6 days ago

kogans1107 commented 6 days ago

Your current environment

I'm attempting to run a multi-node, multi-GPU inference setup using vLLM with pipeline parallelism. 

However, I'm encountering an error related to the number of available GPUs.

(llama_env) kogans@vinaka-vaka-levu:~$ nvidia-smi
Mon Oct  7 13:19:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   36C    P8               2W /  65W |      8MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2766      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

My Environment Information:

vLLM version: 0.6.1.dev238+ge2c6e0a82
PyTorch version: 2.4.0+cu121
CUDA version: 12.0.140
OS: Ubuntu 24.04.1 LTS (x86_64)
GPU: NVIDIA GeForce RTX 2060 with Max-Q Design
Nvidia driver version: 535.183.06

How to Reproduce Error:

1. Download HuggingFace Model 
2. Set Up Docker Environment with Ray and start sh on both machine one as head and other as worker
3.  Run this Command: vllm serve Path/to/model/ --tensor-parallel-size 1 --pipeline-parallel-size 2

Error Log:

INFO 10-04 16:11:19 config.py:1652] Downcasting torch.float32 to torch.float16.
INFO 10-04 16:11:19 config.py:899] Defaulting to use ray for distributed inference
WARNING 10-04 16:11:19 config.py:370] Async output processing can not be enabled with pipeline parallel
2024-10-04 16:11:20,458 INFO worker.py:1786 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/home/kogans/llama_project/llama_env/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/scripts.py", line 165, in main
    args.dispatch_function(args)
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/scripts.py", line 37, in serve
    uvloop.run(run_server(args))
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
    engine_client = build_engine()
                    ^^^^^^^^^^^^^^
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
    initialize_ray_cluster(engine_config.parallel_config)
  File "/home/kogans/llama_project/llama_env/lib/python3.12/site-packages/vllm/executor/ray_utils.py", line 270, in initialize_ray_cluster
    raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the placement group.

Expected Behavior:

The command should successfully start a vLLM server with the specified pipeline parallel configuration.

Attempts made to fix error:

I have ran Ray Status to see number of of GPU available and was confirmed I have 2 GPUS available

I check CUDA status and discovered that on one laptop CUDA isn't discoverable with the current set up
I attempted a couple attempts to connect CUDA before going for a fresh install on CUDA.

How would you like to use vllm

I want to run inference on a local hugging face model and I am having issues integrating the model on Vllm and running it on multiple gpus and multiple nodes.

Before submitting a new issue...

andoorve commented 4 days ago

Can you confirm how you started the ray environment?