Closed BUJIDAOVS closed 2 months ago
@BUJIDAOVS can you please share the commands and hardware used to trigger this error?
@mgoin Good morning I can also confirm that image V0.4.3 is not working. When using V0.4.2 it works as expected.
nvidia-smi
Thu Jun 6 08:59:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off |
| 0% 33C P8 30W / 480W | 191MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off |
| 0% 32C P8 34W / 480W | 11MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1764 G /usr/lib/xorg/Xorg 167MiB |
| 0 N/A N/A 1842 G /usr/bin/gnome-shell 14MiB |
| 1 N/A N/A 1764 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
docker run --gpus all \
-v ./data:/Data \
-p 8000:8000 \
--ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=hf_token" \
vllm/vllm-openai:v0.4.3 \
--model /Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ \
--max-model-len 32000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization .9 \
--api-key sk-1234567890abcdef \
--served-model-name model_name \
--quantization awq \
--dtype half
WARNING 06-06 14:09:52 config.py:213] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-06-06 14:09:54,151 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-06 14:09:54 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ', speculative_config=None, tokenizer='/Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model_name)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ERROR 06-06 14:09:57 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
ERROR 06-06 14:09:57 worker_base.py:148] Traceback (most recent call last):
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-06 14:09:57 worker_base.py:148] return executor(*args, **kwargs)
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
ERROR 06-06 14:09:57 worker_base.py:148] self.worker = worker_class(*args, **kwargs)
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
ERROR 06-06 14:09:57 worker_base.py:148] self.model_runner = ModelRunnerClass(
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
ERROR 06-06 14:09:57 worker_base.py:148] self.attn_backend = get_attn_backend(
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
ERROR 06-06 14:09:57 worker_base.py:148] backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
ERROR 06-06 14:09:57 worker_base.py:148] if torch.cuda.get_device_capability()[0] < 8:
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
ERROR 06-06 14:09:57 worker_base.py:148] prop = get_device_properties(device)
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
ERROR 06-06 14:09:57 worker_base.py:148] _lazy_init() # will define _get_device_properties
ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
ERROR 06-06 14:09:57 worker_base.py:148] torch._C._cuda_init()
ERROR 06-06 14:09:57 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 222, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 317, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
self._init_workers_ray(placement_group)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 169, in _init_workers_ray
self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
return executor(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
self.worker = worker_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
self.model_runner = ModelRunnerClass(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
self.attn_backend = get_attn_backend(
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
if torch.cuda.get_device_capability()[0] < 8:
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
prop = get_device_properties(device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] return executor(*args, **kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] return method(self, *_args, **_kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] self.worker = worker_class(*args, **kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] self.model_runner = ModelRunnerClass(
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] self.attn_backend = get_attn_backend(
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] if torch.cuda.get_device_capability()[0] < 8:
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] prop = get_device_properties(device)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] _lazy_init() # will define _get_device_properties
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] torch._C._cuda_init()
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
I am observing the same behaviour. v0.4.2 works fine, but v0.4.3 gives the error:
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
I am also running this on an RTX 4090.
Same. Tried ~7 working configurations from 0.4.2 on 0.4.3, all failed with same error
...
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
I was given a fix for this issue. Can find it in this issue. https://github.com/vllm-project/vllm/issues/5510
Hi team, same error here.
docker run -it --gpus all --ipc=host --entrypoint /bin/bash vllm/vllm-openai:v0.4.3
I enter the docker container and execute python3
:
import torch
torch.cuda.is_available()
it returns False.
I tried v0.4.3
and v0.5.0
and latest
, all failed.
v0.4.2
returns True correctly.
Information about my domain machine:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
Upgrading host's Nvidia drivers solves this. 550 version works. Not sure what would be a minimal one.
Upgrading host's Nvidia drivers solves this. 550 version works. Not sure what would be a minimal one.
Thanks, it works. Previously, my nvidia driver version was 535 on my domain machine.
Your current environment
vllm-openai_1 | (RayWorkerWrapper pid=3487) ERROR 06-05 16:30:08 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
How would you like to use vllm
everything is good in docker v0.3.3, what should I do to use v0.4.3?