vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.81k stars 4.5k forks source link

[Bug]: distributed model example with num_gpus does not use all gpus provided by the ray actor #3847

Open stikkireddy opened 7 months ago

stikkireddy commented 7 months ago

Your current environment

The output of `python collect_env.py`
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] torch==2.0.1+cu118
[pip3] torchvision==0.15.2+cu118
[pip3] triton==2.0.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    NODE    NODE    0-23    0       N/A
GPU1    NV12     X  SYS SYS 24-47   1       N/A
NIC0    NODE    SYS  X  NODE                
NIC1    NODE    SYS NODE     X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

🐛 Describe the bug

I am running llama 70b and i want do deploy multiple model instances based on the number of ray worker nodes and I am getting this issue. I am using the examples provided by ray repo and it works with 1 gpu but 2 x A100s are not working. Can you please assist on this? It does not seem to be using the 2 gpus and i confirmed with the commented code that there are actually two gpus.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 369.25 MiB is free. Process 86114 has 78.78 GiB memory in use. Of the allocated memory 78.15 GiB is allocated by PyTorch, and 1.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

import numpy as np
import ray
import pandas as pd

from typing import Dict
from vllm import LLM, SamplingParams

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1500)

# Create a class to do batch inference.
class LLMPredictor:

    def __init__(self):
        # Create an LLM.
        self._model_path = model_path
        self._tokenizer_path = tokenizer_path
        self.llm = None

    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
        if self.llm is None:
            self.llm = LLM(model=self._model_path, tokenizer=self._tokenizer_path)

        # Generate texts from the prompts.
        # The output is a list of RequestOutput objects that contain the prompt,
        # generated text, and other information.
        outputs = self.llm.generate(batch["text"], sampling_params)
        prompt = []
        generated_text = []
        for output in outputs:
            prompt.append(output.prompt)
            generated_text.append(' '.join([o.text for o in output.outputs]))
        # import torch
        # import sys
        # is_cuda = torch.cuda.is_available()
        # gpus = torch.cuda.device_count()
        return {
            # "which_python": [str(sys.executable)],
            # "is_cuda": [str(is_cuda)],
            # "gpus": [str(gpus)]
            "prompt": prompt,
            "generated_text": generated_text,
        }

num_concurrency = 1

ds = ray.data.from_pandas(pd.DataFrame({
  "text": pd.Series(questions)
})).repartition(num_concurrency)

ds = ds.map_batches(
    LLMPredictor,
    # Set the concurrency to the number of LLM instances.
    concurrency=num_concurrency,
    # Specify the number of GPUs required per LLM instance.
    # NOTE: Do NOT set `num_gpus` when using vLLM with tensor-parallelism
    # (i.e., `tensor_parallel_size`).
    num_gpus=2,
    # Specify the batch size for inference.
    batch_size=4096,
)

print(ds.take(10))
Imss27 commented 7 months ago

I had a similar issue before, you can try setting gpu_memory_utilization to a lower value like 0.5, the default value is 0.9.

stikkireddy commented 7 months ago

the model requires 2 gpus to run its llama 70b fp16, i need the actors to be able to shard between the two gpus. The problem is less the oom, the problem is that two gpus are available and its not using both.

I am unsure why gpu_memory_utilization will work when running on a single node it works just fine with tensor-parallel-size.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!