vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.59k stars 3.9k forks source link

[Usage]: the docker image v0.4.3 cannot work #5283

Closed BUJIDAOVS closed 2 months ago

BUJIDAOVS commented 3 months ago

Your current environment

vllm-openai_1 | (RayWorkerWrapper pid=3487) ERROR 06-05 16:30:08 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

How would you like to use vllm

everything is good in docker v0.3.3, what should I do to use v0.4.3?

mgoin commented 3 months ago

@BUJIDAOVS can you please share the commands and hardware used to trigger this error?

jefffortune commented 3 months ago

@mgoin Good morning I can also confirm that image V0.4.3 is not working. When using V0.4.2 it works as expected.

nvidia-smi

Thu Jun  6 08:59:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:41:00.0 Off |                  Off |
|  0%   33C    P8              30W / 480W |    191MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:61:00.0 Off |                  Off |
|  0%   32C    P8              34W / 480W |     11MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                          167MiB |
|    0   N/A  N/A      1842      G   /usr/bin/gnome-shell                         14MiB |
|    1   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
docker run --gpus all \
    -v ./data:/Data \
    -p 8000:8000 \
    --ipc=host \
    --env "HUGGING_FACE_HUB_TOKEN=hf_token" \
    vllm/vllm-openai:v0.4.3 \
    --model /Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ \
    --max-model-len 32000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization .9 \
    --api-key sk-1234567890abcdef \
    --served-model-name model_name \
    --quantization awq \
    --dtype half
WARNING 06-06 14:09:52 config.py:213] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-06-06 14:09:54,151 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-06 14:09:54 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ', speculative_config=None, tokenizer='/Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model_name)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ERROR 06-06 14:09:57 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
ERROR 06-06 14:09:57 worker_base.py:148] Traceback (most recent call last):
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-06 14:09:57 worker_base.py:148]     return executor(*args, **kwargs)
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
ERROR 06-06 14:09:57 worker_base.py:148]     self.worker = worker_class(*args, **kwargs)
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
ERROR 06-06 14:09:57 worker_base.py:148]     self.model_runner = ModelRunnerClass(
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
ERROR 06-06 14:09:57 worker_base.py:148]     self.attn_backend = get_attn_backend(
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
ERROR 06-06 14:09:57 worker_base.py:148]     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
ERROR 06-06 14:09:57 worker_base.py:148]     if torch.cuda.get_device_capability()[0] < 8:
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
ERROR 06-06 14:09:57 worker_base.py:148]     prop = get_device_properties(device)
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
ERROR 06-06 14:09:57 worker_base.py:148]     _lazy_init()  # will define _get_device_properties
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
ERROR 06-06 14:09:57 worker_base.py:148]     torch._C._cuda_init()
ERROR 06-06 14:09:57 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 222, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 317, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 169, in _init_workers_ray
    self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
    self.worker = worker_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
    self.model_runner = ModelRunnerClass(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
    self.attn_backend = get_attn_backend(
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
    backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
    if torch.cuda.get_device_capability()[0] < 8:
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     return method(self, *_args, **_kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     self.worker = worker_class(*args, **kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     self.model_runner = ModelRunnerClass(
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     self.attn_backend = get_attn_backend(
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     if torch.cuda.get_device_capability()[0] < 8:
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     prop = get_device_properties(device)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     _lazy_init()  # will define _get_device_properties
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     torch._C._cuda_init()
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
VMinB12 commented 3 months ago

I am observing the same behaviour. v0.4.2 works fine, but v0.4.3 gives the error: RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW I am also running this on an RTX 4090.

ValeryNo commented 3 months ago

Same. Tried ~7 working configurations from 0.4.2 on 0.4.3, all failed with same error

...
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
jefffortune commented 3 months ago

I was given a fix for this issue. Can find it in this issue. https://github.com/vllm-project/vllm/issues/5510

ChengjieLi28 commented 2 months ago

Hi team, same error here.

docker run -it --gpus all --ipc=host --entrypoint /bin/bash  vllm/vllm-openai:v0.4.3

I enter the docker container and execute python3:

import torch
torch.cuda.is_available()
image

it returns False.

I tried v0.4.3 and v0.5.0 and latest, all failed. v0.4.2 returns True correctly.

Information about my domain machine:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
ValeryNo commented 2 months ago

Upgrading host's Nvidia drivers solves this. 550 version works. Not sure what would be a minimal one.

ChengjieLi28 commented 2 months ago

Upgrading host's Nvidia drivers solves this. 550 version works. Not sure what would be a minimal one.

Thanks, it works. Previously, my nvidia driver version was 535 on my domain machine.