vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.35k stars 3.32k forks source link

[Bug]: 英伟达最新驱动555.85,vllm运行报错 #5035

Open gaye746560359 opened 1 month ago

gaye746560359 commented 1 month ago

2024-05-24 23:49:38 WARNING 05-24 15:49:38 utils.py:327] Not found nvcc in /usr/local/cuda. Skip cuda version check! 2024-05-24 23:49:38 INFO 05-24 15:49:38 config.py:379] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it may cause slight accuracy drop without scaling factors. FP8_E5M2 (without scaling) is only supported on cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for common inference criteria. 2024-05-24 23:49:38 WARNING 05-24 15:49:38 config.py:405] Possibly too large swap space. 4.00 GiB out of the 9.71 GiB total CPU memory is allocated for the swap space. 2024-05-24 23:49:38 INFO 05-24 15:49:38 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='shenzhi-wang/Llama3-8B-Chinese-Chat', speculative_config=None, tokenizer='shenzhi-wang/Llama3-8B-Chinese-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=gpt-3.5-turbo) 2024-05-24 23:49:39 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-05-24 23:49:39 INFO 05-24 15:49:39 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 2024-05-24 23:49:39 WARNING 05-24 15:49:39 utils.py:465] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. 2024-05-24 23:49:39 Traceback (most recent call last): 2024-05-24 23:49:39 File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main 2024-05-24 23:49:39 return _run_code(code, main_globals, None, 2024-05-24 23:49:39 File "/usr/lib/python3.10/runpy.py", line 86, in _run_code 2024-05-24 23:49:39 exec(code, run_globals) 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 168, in <module> 2024-05-24 23:49:39 engine = AsyncLLMEngine.from_engine_args( 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args 2024-05-24 23:49:39 engine = cls( 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 324, in __init__ 2024-05-24 23:49:39 self.engine = self._init_engine(*args, **kwargs) 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine 2024-05-24 23:49:39 return engine_class(*args, **kwargs) 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 160, in __init__ 2024-05-24 23:49:39 self.model_executor = executor_class( 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__ 2024-05-24 23:49:39 self._init_executor() 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 23, in _init_executor 2024-05-24 23:49:39 self._init_non_spec_worker() 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 67, in _init_non_spec_worker 2024-05-24 23:49:39 self.driver_worker = self._create_worker() 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 59, in _create_worker 2024-05-24 23:49:39 wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank, 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 131, in init_worker 2024-05-24 23:49:39 self.worker = worker_class(*args, **kwargs) 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 73, in __init__ 2024-05-24 23:49:39 self.model_runner = ModelRunner( 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 145, in __init__ 2024-05-24 23:49:39 self.attn_backend = get_attn_backend( 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 25, in get_attn_backend 2024-05-24 23:49:39 backend = _which_attn_to_use(dtype) 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 67, in _which_attn_to_use 2024-05-24 23:49:39 if torch.cuda.get_device_capability()[0] < 8: 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability 2024-05-24 23:49:39 prop = get_device_properties(device) 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties 2024-05-24 23:49:39 _lazy_init() # will define _get_device_properties 2024-05-24 23:49:39 File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init 2024-05-24 23:49:39 torch._C._cuda_init() 2024-05-24 23:49:39 RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

HelloCard commented 1 month ago

wsl以及cudaGetDeviceCount错误……我遇到过一个类似的故障,原因是wsl外面安装了向日葵远程工具,向日葵创建的虚拟显卡干扰了wsl里面的vllm。

gaye746560359 commented 1 month ago

wsl以及cudaGetDeviceCount错误……我遇到过一个类似的故障,原因是wsl外面安装了向日葵远程工具,向日葵创建的虚拟显卡干扰了wsl里面的vllm。

我卸载了向日葵,运行还是报错

MarioLiebisch commented 1 month ago

Stumbled over this issue while looking around to see if there have been any fixes.

I just checked the Nvidia driver feedback thread and it's actually a listed known issue:

PyTorch-CUDA Docker not compatible with CUDA 12.5/GRD 555.85 [4668302]

cliffwoolley commented 1 month ago

Please see https://github.com/NVIDIA/nvidia-container-toolkit/issues/520 .

gaye746560359 commented 1 month ago

请参阅NVIDIA/nvidia-container-toolkit#520

Docker Desktop 时遇到此症状,则表示正在修复(升级捆绑的 nvidia-container-toolkit);预计什么时候修复更新?

cliffwoolley commented 1 month ago

请参阅NVIDIA/nvidia-container-toolkit#520

Docker Desktop 时遇到此症状,则表示正在修复(升级捆绑的 nvidia-container-toolkit);预计什么时候修复更新?

We are moving as quickly with it as we can, but I don't have an ETA yet, which is why I didn't list one in the other issue.

cliffwoolley commented 1 month ago

Docker Desktop 4.31 was released yesterday and includes NVIDIA Container Toolkit 1.15.0, which resolves this issue.