vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.59k stars 3.9k forks source link

Unable to run distributed inference on ray with tensor parallel size > 1 #3190

Closed pravingadakh closed 3 months ago

pravingadakh commented 6 months ago

I am referring to https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_distributed.py example. The suggestion in the example is not to set num_gpus when tensor parallel is used. However with that I ran into following issue:

Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.

I assume this is because the head node does not have GPUs configured. I then tried the ray placement group as recommended. Here is the code snippet for it:

resource_bundles = [{"GPU": 1, "CPU": 1} for i in range(8)]
pg = placement_group(resource_bundles, strategy="PACK")
ready, unready = ray.wait([pg.ready()], timeout=60)
predictions = dataset.map_batches(VLLMPredictor,batch_format="numpy", batch_size=100, concurrency=4, scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg))

With this however the job was stuck at loading model phase and eventually timed out. I then tried to set GPU to 2 in bundle as I have set tensor parallel size 2, but now I started getting following error:

Placement group bundle cannot have more than 1 GPU.

Is there a reason vLLM restricts GPU size to be just 1?

pravingadakh commented 6 months ago

@c21 I see you worked on the original distributed example, would you be able to help me find what is it that I am missing here?

c21 commented 6 months ago

Hi @pravingadakh, can you provide all information of the environment?

stikkireddy commented 5 months ago

@c21 Any solution to this, I am running into a similar issue very simply I am trying to run vllm on a ray cluster. And try to have it use 2 gpus for llama 70b chat:

If I use tensor_parallelism settings on a single node it works but when using ray with multi gpu i get:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 369.25 MiB is free. Process 922399 has 78.78 GiB memory in use. Of the allocated memory 78.15 GiB is allocated by PyTorch, and 1.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

when i execute the function with map_batches and return the num of gpus with torch i get this:

{'is_cuda': 'True', 'gpus': '2'}
ds = ds.map_batches(
   Predictor,
    # Set the concurrency to the number of LLM instances.
    concurrency=1,
    # Specify the number of GPUs required per LLM instance.
    # NOTE: Do NOT set `num_gpus` when using vLLM with tensor-parallelism
    # (i.e., `tensor_parallel_size`).
    num_gpus=2,
    # Specify the batch size for inference.
    batch_size=32,
)
pravingadakh commented 5 months ago

@c21 Apologies for the delay in response, I got occupied with other work stuff. Our raycluster setup has 6 worker nodes, each with 2 A100 80 GB GPUs (18 CPUs) and a head node with 10 CPUs (no GPU). We want to run llama-13b model with tensor_parallel_size 2. I initially tried following code

import ray
import pandas as pd
from typing import Dict
import numpy as np

dataset = ray.data.read_csv("/genai/inference_nums.csv")

class VLLMPredictor:
    def __init__(self):
        from vllm import LLM, SamplingParams
        model_path = "/genai/llms/llama2-13b-chat-bin"
        self.llm = LLM(model=model_path, tensor_parallel_size=2)
        self.sampling_params = SamplingParams(top_p=0.99, top_k=1, temperature=0.01, max_tokens=100)

    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
        predictions = self.llm.generate(list(batch["input"]), self.sampling_params)
        batch["output"] = [preds.outputs[0].text for preds in predictions]
        return batch

predictions = dataset.map_batches(VLLMPredictor, batch_format="numpy",  batch_size=100, concurrency=2)
print(predictions.count())

Above code fails with this error ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.

Adding num_gpus to map_batches also fails with same error. If we use placement group (with 1 GPU bundle) the job fails to load the model and times out. And vLLM doesn't allow to use placement group bundle with 2 GPUs, not sure why.

sam-h-bean commented 4 months ago

@c21 @pravingadakh this seems like a pretty fundamental issue right now. when TP > 1 the parallel config sets worker_use_ray to True otherwise it is False. This is what causes issues when you are trying to run in a Ray batch job. This code needs to see if ray is already running then not init if running in ray... Something like that

su-park commented 4 months ago

Hello.

It seems like a question related to the above issue, so I'm inquiring about it together below.

Currently, we are conducting inference using Mistral 7B model with V100 16GB 8EA.

Due to the size of the model, multiple GPUs are required to launch LLM instances.

It seems necessary to utilize tensor parallel and data parallel simultaneously.

In this case, is it possible to configure appropriate parameters for vllm and ds.map_batches?

DarkLight1337 commented 3 months ago

We have added documentation for this situation in #5430. Please take a look.