vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.74k stars 4.26k forks source link

Tensor parallelism on ray cluster #1566

Closed baojunliu closed 4 months ago

baojunliu commented 11 months ago

I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine on a single instance when I don't use a ray cluster. I can only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?

hughesadam87 commented 11 months ago

Solved: my particular issue (not necessarily that of OP) was that the RayCluster that runs locally on the single node (because we're not doing distributed inference), didn't have enough memory.

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: mycontainer
    image: myimage
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: "10.24Gi"

Dont know Ray well enough to understand waht this does lol


I am unaffiliated with OP, but believe we are having the same issue. We're using kubernetes to deploy a model an a single g4.12xlarge instance (4GPUs). We cannot use a newer model class for various reasons. To troubleshoot, I've chosen a small model that runs easily on a single GPU.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        workload-type: gpu
      containers:
        - name: vllm-container
          image: lmisam/vllm-openai:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 8000
              name: api
          resources:
            limits:
              nvidia.com/gpu: "4"
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1,2,3" # ensures the container app can see all 4 GPUs

          args: [ "--model", "TheBloke/OpenHermes-2-Mistral-7B-AWQ",
                  "--quantization", "awq",
                  "--dtype", "float16",
                  "--max-model-len", "4096",
                  "--tensor-parallel-size", "1"]`   #<---- Important

This is overkill, but as you see we're making 4 GPUs available to the container, despite only running on one of them. I've also confirmed from shelling into the container and running pytorch commands that it does have 4 GPUs accessible.

When --tensor-parallel-size=1 or the flag is not included, the model works just fine.

When the flag is set to 2 or more, we get a long stracktrace, the relevant portion is shown below.

Do I have to manually start the Ray Cluster or do any other env settings or something so that it is up and healthy when the Docker container starts? Or does vllm abstract this entirely and there's nothing else to do on our end?

    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 181, in _init_workers_ray
    self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 704, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: 1f4e0ddffc9942a1e34140a601000000
    pid: 1914
    namespace: 0c0e14ce-0a35-4190-ad47-f0a1959e7fe4
    ip: 172.31.15.88
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,238    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffedad1d4643afe66a344639bf01000000 Worker ID: 9cd611fa721b93a5d424da4383573ab345580a0a19f9301512aa384f Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 38939 Worker PID: 1916 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,272    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff45041ffe54799af4388ab20401000000 Worker ID: 708f3407d01c3576315da0569ada9fcd984f754e43417718e04ca93b Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 46579 Worker PID: 1915 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-11-06 15:44:56,316    WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff1252d098d6462eb8cd537bd601000000 Worker ID: acea4bee761db0a4bff1adb3cc43d1e3ba1793d3017c3fc22e18e6d7 Node ID: 27fa470afbb4822adc44dc5bbad10b5584e94e8f7b1d843bbcb63485 Worker IP address: 172.31.15.88 Worker port: 45813 Worker PID: 1913 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1916) *** SIGBUS received at time=1699285494 on cpu 2 *** [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorker pid=1916) PC: @     0x7fcfb18c6c3a  (unknown)  (unknown) [repeated 3x across cluster]
(RayWorker pid=1916)     @     0x7fcfb1759520       3392  (unknown) [repeated 3x across cluster]
(RayWorker pid=1916)     @ 0x32702d6c63636e2f  (unknown)  (unknown) [repeated 3x across cluster]
JenniePing commented 10 months ago

I am using vllm on a ray cluster, multiple nodes and 4 gpus on each node. I am trying to load llama model with more than one gpu by setting tensor_parallel_size=2. The model won't load. It works fine with on a single instance when I don't use a ray cluster. I cannot only set tensor_parallel_size=1 on ray cluster. Is there a way to use tensor parallelism on a ray cluster?

Same, any solution please?

qizzzh commented 10 months ago

Hit the same issue

qizzzh commented 10 months ago

https://github.com/vllm-project/vllm/issues/1058#issuecomment-1846600442 could be related

baojunliu commented 9 months ago

Here is the finding for may case:

when submit remote job, it claims gpus. For the following code, it takes 1 gpu

 @ray.remote(num_gpus=1)
class my class
....

When vllm runs tensor parallelism, it will create gpu cluster. However the gpu is unavailable, so the job get timeout eventually.

Is there a way to use the gpus assigned to the remote job?

qizzzh commented 9 months ago

It’s the same as my finding in https://github.com/vllm-project/vllm/issues/1058#issuecomment-1846600442. I used custom resources to work around it. Ideally vLLM should have a way to pass in already assigned logical resources.

ernestol0817 commented 8 months ago

I am also running into the same issue on redhat OpenShift.

import torch torch.cuda.is_available() True torch.cuda.device_count() 2

llm = VLLM(model="meta-llama/Llama-2-13b-chat-hf", trust_remote_code=True, max_new_tokens=50, temperature=0.1, tensor_parallel_size=2, )

Startup hangs here:

INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

ernestol0817 commented 8 months ago

I found issue 31897 over on ray serves repo: looks similar: https://github.com/ray-project/ray/issues/31897

ernestol0817 commented 8 months ago

I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done. 1) I modified .../site-packages/vllm/engine/ray_utils.py

look for the line 83

--> ray.init(address=ray_address, ignore_reinit_error=True) Modify this to: --> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)

2) Pay very special attention to your ENV, including python version, installed libraries. Here is what I'm running now and its working:

Package Version


adal 1.2.7 aiohttp 3.8.6 aiohttp-cors 0.7.0 aioprometheus 23.12.0 aiorwlock 1.3.0 aiosignal 1.3.1 annotated-types 0.6.0 anyio 3.7.1 applicationinsights 0.11.10 archspec 0.2.1 argcomplete 1.12.3 async-timeout 4.0.3 attrs 21.4.0 azure-cli-core 2.40.0 azure-cli-telemetry 1.0.8 azure-common 1.1.28 azure-core 1.29.6 azure-identity 1.10.0 azure-mgmt-compute 23.1.0 azure-mgmt-core 1.4.0 azure-mgmt-network 19.0.0 azure-mgmt-resource 20.0.0 backoff 1.10.0 bcrypt 4.1.2 blessed 1.20.0 boltons 23.0.0 boto3 1.26.76 botocore 1.29.165 Brotli 1.0.9 cachetools 5.3.2 certifi 2023.11.17 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 2.2.0 colorful 0.5.5 commonmark 0.9.1 conda 23.11.0 conda-content-trust 0.2.0 conda-libmamba-solver 23.12.0 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 cryptography 38.0.1 Cython 0.29.32 dataclasses-json 0.6.3 distlib 0.3.7 distro 1.8.0 dm-tree 0.1.8 exceptiongroup 1.2.0 Farama-Notifications 0.0.4 fastapi 0.104.0 filelock 3.13.1 flatbuffers 23.5.26 frozenlist 1.4.0 fsspec 2023.5.0 google-api-core 2.14.0 google-api-python-client 1.7.8 google-auth 2.23.4 google-auth-httplib2 0.2.0 google-oauth 1.0.1 googleapis-common-protos 1.61.0 gptcache 0.1.43 gpustat 1.1.1 greenlet 3.0.3 grpcio 1.59.3 gymnasium 0.28.1 h11 0.14.0 httpcore 1.0.2 httplib2 0.22.0 httptools 0.6.1 httpx 0.26.0 huggingface-hub 0.20.3 humanfriendly 10.0 idna 3.6 imageio 2.31.1 isodate 0.6.1 jax-jumpy 1.0.0 Jinja2 3.1.3 jmespath 1.0.1 jsonpatch 1.33 jsonpointer 2.1 jsonschema 4.17.3 knack 0.10.1 langchain 0.1.4 langchain-community 0.0.16 langchain-core 0.1.16 langsmith 0.0.83 lazy_loader 0.3 libmambapy 1.5.3 lz4 4.3.2 MarkupSafe 2.1.4 marshmallow 3.20.2 menuinst 2.0.1 mpmath 1.3.0 msal 1.18.0b1 msal-extensions 1.0.0 msgpack 1.0.7 msrest 0.7.1 msrestazure 0.6.4 multidict 6.0.4 mypy-extensions 1.0.0 networkx 3.1 ninja 1.11.1.1 numpy 1.26.3 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.535.133 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 openai 1.10.0 opencensus 0.11.3 opencensus-context 0.1.3 opentelemetry-api 1.1.0 opentelemetry-exporter-otlp 1.1.0 opentelemetry-exporter-otlp-proto-grpc 1.1.0 opentelemetry-proto 1.1.0 opentelemetry-sdk 1.1.0 opentelemetry-semantic-conventions 0.20b0 orjson 3.9.12 packaging 23.2 pandas 1.5.3 paramiko 2.12.0 Pillow 9.2.0 pip 23.3.1 pkginfo 1.9.6 platformdirs 3.11.0 pluggy 1.0.0 portalocker 2.8.2 prometheus-client 0.19.0 protobuf 3.19.6 psutil 5.9.6 py-spy 0.3.14 pyarrow 12.0.1 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycosat 0.6.6 pycparser 2.21 pydantic 1.10.13 pydantic_core 2.14.1 Pygments 2.13.0 PyJWT 2.8.0 PyNaCl 1.5.0 pyOpenSSL 22.1.0 pyparsing 3.1.1 pyrsistent 0.20.0 PySocks 1.7.1 python-dateutil 2.8.2 python-dotenv 1.0.0 pytz 2022.7.1 PyWavelets 1.4.1 PyYAML 6.0.1 quantile-python 1.1 ray 2.9.1 ray-cpp 2.9.1 redis 3.5.3 regex 2023.12.25 requests 2.31.0 requests-oauthlib 1.3.1 rich 12.6.0 rsa 4.7.2 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.6 s3transfer 0.6.2 safetensors 0.4.2 scikit-image 0.21.0 scipy 1.10.1 sentencepiece 0.1.99 setuptools 68.2.2 six 1.16.0 smart-open 6.2.0 sniffio 1.3.0 SQLAlchemy 2.0.25 starlette 0.27.0 sympy 1.12 tabulate 0.9.0 tenacity 8.2.3 tensorboardX 2.6 tifffile 2023.7.10 tokenizers 0.15.1 torch 2.1.2 tqdm 4.65.0 transformers 4.37.1 triton 2.1.0 truststore 0.8.0 typer 0.9.0 typing_extensions 4.8.0 typing-inspect 0.9.0 uritemplate 3.0.1 urllib3 1.26.18 uvicorn 0.22.0 uvloop 0.19.0 virtualenv 20.21.0 vllm 0.2.7 watchfiles 0.19.0 wcwidth 0.2.12 websockets 11.0.3 wheel 0.41.2 xformers 0.0.23.post1 yarl 1.9.3 zstandard 0.19.0

DN-Dev00 commented 8 months ago

same issue.

image
Jeffwan commented 6 months ago

I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.

RomanKoshkin commented 5 months ago

I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance coming into cluster issues.

How? Can you post a minimal example, please?

refill-dn commented 5 months ago

it was problem of nvidia cuda driver 545 bug. I was upgraded 550-beta driver(currently 550). and solved. (I am DN-Dev00)

nelsonspbr commented 4 months ago

I wanted to update this thread as I've found a resolution to this issue, and it might be good to include this in the vLLM documentation. I'm running on a very large OpenShift cluster with a high numbe of CPU on the nodes, and after digging really deep into RAY I found the issue is not with vLLM but rather how RAY works and this simply needed 2 things done.

  1. I modified .../site-packages/vllm/engine/ray_utils.py

look for the line 83

--> ray.init(address=ray_address, ignore_reinit_error=True) Modify this to: --> ray.init(address=ray_address, ignore_reinit_error=True, num_gpus=2, num_cpus=2)

This actually solved my problem — running vLLM with TP via Ray within a container provsioned via OpenShift. I can share more details if needed. Thanks @ernestol0817 !

youkaichao commented 4 months ago

can you please reply to https://github.com/vllm-project/vllm/issues/4462 and https://github.com/vllm-project/vllm/issues/5360 ? some openshift users are suffering there.

thobicex commented 3 months ago

I am still looking forward to get my sponsor GitHub profile verified, when I am in the state. Am out of the country for projects which needed to be fulfill with LLC law firms.

rcarrata commented 3 months ago

@nelsonspbr can you please post your example of vllm with TP via Ray within a container provisioned via OpenShift? I'm really interested! On the other hand, is this creating a Ray Cluster with the vLLM image in one of the ray workers in the workerGroupSpecs?

jbohnslav commented 3 days ago

Would appreciate a working example. I'm having difficulties running more than one tensor parallel Ray Serve application. I suspect it has something to do with vLLM initializing Ray / altering placement groups within each application.