vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.22k stars 4.18k forks source link

RuntimeError: CUDA error: no kernel image is available for execution on the device #629

Closed rookielyb closed 6 months ago

rookielyb commented 1 year ago

Error: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

nvcc -V Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0

conda list: cudatoolkit-dev 11.7.0 cudatoolkit 11.7.0 torch 2.0.1+cu117

nvidia-smi A100 80G NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4

how to solve this problem? thanks!

rookielyb commented 1 year ago

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.1 LTS Release: 20.04 Codename: focal

cat /proc/version Linux version 5.4.0-126-generic (buildd@lcy02-amd64-072) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)) https://github.com/vllm-project/vllm/pull/142-Ubuntu SMP Fri Aug 26 12:12:57 UTC 2022

LiuXiaoxuanPKU commented 1 year ago

could you check that the problem still exits after rebuilding the repo (pip install -e .)?

rookielyb commented 1 year ago

could you check that the problem still exits after rebuilding the repo (pip install -e .)? pip install -e . : Building wheels for collected packages: vllm Building editable for vllm (pyproject.toml) ... done Created wheel for vllm: filename=vllm-0.1.2-0.editable-cp310-cp310-linux_x86_64.whl size=8465 sha256=8154890edc8a5b3b0100d83308d973ec25ecee72b02d479023542312fee2fd1d Stored in directory: /tmp/pip-ephem-wheel-cache-vg3isffo/wheels/33/fc/d6/f27b3ac96c14477426ab8fd6d5573e139cf29c857e206d16a3 Successfully built vllm Installing collected packages: vllm Attempting uninstall: vllm Found existing installation: vllm 0.1.2 Uninstalling vllm-0.1.2: Successfully uninstalled vllm-0.1.2 Successfully installed vllm-0.1.2

CUDA_VISIBLE_DEVICES=0 python offline_inference.py: File "/home/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/cg/vllm/vllm/modelexecutor/models/opt.py", line 102, in forward output, = self.out_proj(attn_output) File "/home/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "/cg/vllm/vllm/model_executor/parallel_utils/tensorparallel/layers.py", line 443, in forward output = output + self.bias if self.bias is not None else output_ RuntimeError: CUDA error: no kernel image is available for execution on the device Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Processed prompts: 0%| | 0/4 [00:00<?, ?it/s]

still not resolved

YHPeter commented 1 year ago

I have the same issue. Same nvcc/driver both 11.7 When running: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf This solution works for me.

could you check that the problem still exits after rebuilding the repo (pip install -e .)?

LuJunru commented 1 year ago

I met same issue. The problem is I could run vllm on V100 with cuda 11.3, while can not run on A100 with cuda 12.0. I used exact same codes and docker, except cuda.

Gitwangpin commented 1 year ago

I have the same issue. Same nvcc/driver both 11.7 When running: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf This solution works for me.

could you check that the problem still exits after rebuilding the repo (pip install -e .)?

I use A100 A40 T4 and use langchain to integrate vllm to encounter this problem in cuda11.7, but I am normal on RTX4090 and RTX3090 Here is the exception output: INFO 08-25 10:41:25 llm_engine.py:70] Initializing an LLM engine with config: model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2 -7b-chat-hf', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) INFO 08-25 10:41:25 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer ' instead of the original tokenizer. Traceback (most recent call last): File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/api_server.py", line 78, in engine = AsyncLLMEngine. from_engine_args(engine_args) File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 232, in from_engine_args engine = cls(engine_args. worker_use_ray, File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 55, in init self. engine = engine_class(*args, kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 104, in init self._init_cache() File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 182, in _init_cache num_blocks = self._run_workers( File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 470, in _run_workers output = executor(*args, *kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 108, in profile_num_available_blocks self.model( File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 253, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 220, in forward hidden_states = layer( File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 172, in forward hidden_states = self.self_attn( File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/modelexecutor/models/llama.py", line 132, in forward qkv, = self.qkv_proj(hidden_states) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 309, in forward output_parallel = F. linear(input_parallel, self. weight, bias) RuntimeError: CUDA error: no kernel image is available for execution on the device Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Gitwangpin commented 1 year ago

same the model meta-llama/Llama-2-7b-chat-hf

lonngxiang commented 1 year ago

how to fix?

File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 309, in forward output_parallel = F.linear(input_parallel, self.weight, bias) RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. root@fa4b127ca2bf:/workspace# python

chenchenygu commented 1 year ago

Also running into this same issue. Has anyone found a fix?

Symbolk commented 1 year ago

Same issue +1! ANY updates or issues? Anyone tried updating Driver Version: 470.141.03 to 515?

Symbolk commented 1 year ago

Same issue +1! ANY updates or issues? Anyone tried updating Driver Version: 470.141.03 to 515?

Solved, by recompiling and reinstalling the lib when deploying on V100. Previously it was compiled on A100.

YHPeter commented 1 year ago

I need to switch between several GPUs (A100, V100, RTX8000) and cuda version, when the CUDA version changes, I need to reinstall it from source. It's a shot-time solution, but it works now!

shuai-dian commented 1 year ago

same issue

root@server-1:~/vllm# python3 -m vllm.entrypoints.api_server --model /workspace/Qwen-7B/weights/Qwen-7B-Chat/ --trust-remote-code 2023-09-25 06:17:03,382 INFO worker.py:1642 -- Started a local Ray instance. INFO 09-25 06:17:04 llm_engine.py:72] Initializing an LLM engine with config: model='/workspace/Qwen-7B/weights/Qwen-7B-Chat/', tokenizer='/workspace/Qwen-7B/weights/Qwen-7B-Chat/', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.float16, download_dir=None, load_format=auto, tensor_parallel_size=4, seed=0) WARNING 09-25 06:17:05 tokenizer.py:66] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/vllm/vllm/entrypoints/api_server.py", line 177, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/root/vllm/vllm/engine/async_llm_engine.py", line 442, in from_engine_args engine = cls(engine_args.worker_use_ray, File "/root/vllm/vllm/engine/async_llm_engine.py", line 250, in init self.engine = self._init_engine(*args, kwargs) File "/root/vllm/vllm/engine/async_llm_engine.py", line 279, in _init_engine return engine_class(*args, *kwargs) File "/root/vllm/vllm/engine/llm_engine.py", line 105, in init self._init_cache() File "/root/vllm/vllm/engine/llm_engine.py", line 185, in _init_cache num_blocks = self._run_workers( File "/root/vllm/vllm/engine/llm_engine.py", line 687, in _run_workers all_outputs = ray.get(all_outputs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2547, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::RayWorker.execute_method() (pid=13558, ip=192.168.1.3, actor_id=60a52a454709a59765dbb35e01000000, repr=<vllm.engine.ray_utils.RayWorker object at 0x7f3ad7c33370>) File "/root/vllm/vllm/engine/ray_utils.py", line 29, in execute_method return executor(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/root/vllm/vllm/worker/worker.py", line 108, in profile_num_available_blocks self.model( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/root/vllm/vllm/model_executor/models/qwen.py", line 240, in forward hidden_states = self.transformer(input_ids, positions, kv_caches, File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/vllm/vllm/model_executor/models/qwen.py", line 205, in forward hidden_states = layer( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/vllm/vllm/model_executor/models/qwen.py", line 171, in forward hidden_states = residual + hidden_states RuntimeError: CUDA error: no kernel image is available for execution on the device

I use cuda118 docker,builded in A800 , The Same Docker copy to H100 GPU, When it restart, there was this issue, The problem persists after rebuild

schnurro-bits commented 1 year ago

I've been having the same issue for the last week.

GPU: Titan XP with CUDA 12.0 nvcc -V == 11.8 torch.version.cuda == 11.8

I've tried reinstalling PyTorch, which did not resolve the problem.

Fr4nk1inCs commented 1 year ago

I've been having the same issue for the last week.

GPU: Titan XP with CUDA 12.0 nvcc -V == 11.8 torch.version.cuda == 11.8

I've tried reinstalling PyTorch, which did not resolve the problem.

@schnurromafia I don't think Titan Xp is supported, since its compute capacity is 6.1. One of vLLM's requirements is GPU with compute capability 7.0 or higher.

schnurro-bits commented 1 year ago

@Fr4nk1inCs thanks for the message!

There is not technical limitation to running vllm with CC < 7.0 (see https://github.com/vllm-project/vllm/issues/963#issuecomment-1714100911), apart from not being able to load some models like Falcon. The workaround is to build from source and comment out a couple lines: https://github.com/vllm-project/vllm/issues/463#issuecomment-1636070685

I've been able to run vllm without having this issue for weeks. Reverting back to old commits does not resolve it, which probably means that vllm is not responsible for this error. Just curious if anyone else has had this happen to them...

Fr4nk1inCs commented 1 year ago

@schnurromafia Thanks for the message! I was also trying to run vLLM on Pascal GPUs. I'll build it from source and see if it works.

ChChwang commented 10 months ago

is pytorch '1.10.1+cu111' ok ?

pseudotensor commented 6 months ago

@hmellor What resolved this issue?

hmellor commented 6 months ago

Solved, by recompiling and reinstalling the lib when deploying on V100. Previously it was compiled on A100.

I was going through stale issues (no activity in over 3 months) and this one looked to be resolved due to comments such as the one quoted above.

Since vLLM changes so fast, and nobody has encountered this issue in over 3 months, it seemed reasonable to close this issue. If somebody encounters it again, they can open a new issue using the new issue templates making the bug report more actionable.

@pseudotensor are you currently experiencing this issue?

cccx3 commented 6 months ago

I am also experiencing this using both T4 and V100 GPU's on Colab

hmellor commented 6 months ago

@cccx3 in that case, could you please open a new issue using the new templates so we can better understand the cause of your issue?

chan-98 commented 6 months ago

I am also experiencing this using both T4 and V100 GPU's on Colab

@cccx3 Could you tell me how you resolved it, if you did? I'm also trying to run it on colab

kingljl commented 6 months ago

@cccx3 in that case, could you please open a new issue using the new templates so we can better understand the cause of your issue?

I also encountered the same problem

joebreaker commented 5 months ago

CUDA kernel failed : no kernel image is available for execution on the device void prescan_small(int , int , int, int, CUstream_st *) at L:126 in C:\Users\reall\Softwares\Miniconda3\envs\Wonder3D_Projects\torchmcubes\cxx\pscan.cu Having this when working with 3d in ComfyUI

wujohns commented 4 months ago

same error

DarkLight1337 commented 4 months ago

same error

Please open a new issue and provide your own error trace. This one is very old and might not have the same cause.

Tuanshu commented 4 months ago

same error, worked on V100 and fails on P100

ZiruiYan commented 4 months ago

Same issue +1! ANY updates or issues? Anyone tried updating Driver Version: 470.141.03 to 515?

Solved, by recompiling and reinstalling the lib when deploying on V100. Previously it was compiled on A100.

I think this is the current solution to this error. Recompiling and reinstalling the vllm works for me.