vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.88k stars 3.43k forks source link

[Bug]: CUDA error: invalid argument #3743

Open qingjiaozyn opened 4 months ago

qingjiaozyn commented 4 months ago

Your current environment

在A800(80G显存) 2卡机器上启动两个qwen-14B的模型,一张卡上一个模型,第一个模型启动正常,但是在启动第二个模型的时候,vllm版本是0.3.3

🐛 Describe the bug

WARNING 03-29 18:28:18 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 03-29 18:28:50 llm_engine.py:357] # GPU blocks: 3531, # CPU blocks: 327 2024-03-29 18:28:51,319 xinference.core.worker 75 ERROR Failed to load model merge_qwen_ccb-1-0 Traceback (most recent call last): File "/app/xinference/xinference/core/worker.py", line 569, in launch_builtin_model await model_ref.load() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive result = func(*args, kwargs) File "/app/xinference/xinference/core/model.py", line 239, in load self._model.load() File "/app/xinference/xinference/model/llm/vllm/core.py", line 147, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 628, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 369, in _init_engine return engine_class(args, kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 131, in init self._init_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 52, in init self.cpu_cache = self.allocate_cpu_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2024-03-29 18:28:51,875 xinference.api.restful_api 8 ERROR [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. Traceback (most recent call last): File "/app/xinference/xinference/api/restful_api.py", line 793, in launch_model model_uid = await ( File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/app/xinference/xinference/core/supervisor.py", line 803, in launch_builtin_model await _launch_model() File "/app/xinference/xinference/core/supervisor.py", line 767, in _launch_model await _launch_one_model(rep_model_uid) File "/app/xinference/xinference/core/supervisor.py", line 748, in _launch_one_model await worker_ref.launch_builtin_model( File "xoscar/core.pyx", line 284, in pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.pyx_actor_method_wrapper result = await result File "/app/xinference/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, **kwargs) File "/app/xinference/xinference/core/worker.py", line 569, in launch_builtin_model await model_ref.load() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.on_receive result = func(*args, **kwargs) File "/app/xinference/xinference/core/model.py", line 239, in load self._model.load() File "/app/xinference/xinference/model/llm/vllm/core.py", line 147, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 628, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 369, in _init_engine return engine_class(args, **kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 131, in init__ self._init_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 52, in init__ self.cpu_cache = self.allocate_cpu_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

self._lock None, inspect.iscoroutinefunction(fn) True inspect.isgenerator(ret) False “”“”

youkaichao commented 4 months ago

Usually this is because some code sets CUDA_VISIBLE_DEVICES incorrectly. I would suggest you try to edit /opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106 , and print torch.cuda.device_count() and os.environ["CUDA_VISIBLE_DEVICES"] to see what's wrong.

In addition, it seems you are using vllm through xinference, so this might be their problem, too. Things get even more complicated because you want to start two vllm engine, which needs careful consideration on CUDA_VISIBLE_DEVICES.

Novak-22 commented 3 months ago

你好,我想问一下你最后解决了吗? @qingjiaozyn