Open qingjiaozyn opened 4 months ago
Usually this is because some code sets CUDA_VISIBLE_DEVICES
incorrectly. I would suggest you try to edit /opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106 , and print torch.cuda.device_count()
and os.environ["CUDA_VISIBLE_DEVICES"]
to see what's wrong.
In addition, it seems you are using vllm
through xinference
, so this might be their problem, too. Things get even more complicated because you want to start two vllm
engine, which needs careful consideration on CUDA_VISIBLE_DEVICES
.
你好,我想问一下你最后解决了吗? @qingjiaozyn
Your current environment
在A800(80G显存) 2卡机器上启动两个qwen-14B的模型,一张卡上一个模型,第一个模型启动正常,但是在启动第二个模型的时候,vllm版本是0.3.3
🐛 Describe the bug
WARNING 03-29 18:28:18 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 03-29 18:28:50 llm_engine.py:357] # GPU blocks: 3531, # CPU blocks: 327 2024-03-29 18:28:51,319 xinference.core.worker 75 ERROR Failed to load model merge_qwen_ccb-1-0 Traceback (most recent call last): File "/app/xinference/xinference/core/worker.py", line 569, in launch_builtin_model await model_ref.load() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive result = func(*args, kwargs) File "/app/xinference/xinference/core/model.py", line 239, in load self._model.load() File "/app/xinference/xinference/model/llm/vllm/core.py", line 147, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 628, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 369, in _init_engine return engine_class(args, kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 131, in init self._init_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 52, in init self.cpu_cache = self.allocate_cpu_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.2024-03-29 18:28:51,875 xinference.api.restful_api 8 ERROR [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions. Traceback (most recent call last): File "/app/xinference/xinference/api/restful_api.py", line 793, in launch_model model_uid = await ( File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/app/xinference/xinference/core/supervisor.py", line 803, in launch_builtin_model await _launch_model() File "/app/xinference/xinference/core/supervisor.py", line 767, in _launch_model await _launch_one_model(rep_model_uid) File "/app/xinference/xinference/core/supervisor.py", line 748, in _launch_one_model await worker_ref.launch_builtin_model( File "xoscar/core.pyx", line 284, in pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.pyx_actor_method_wrapper result = await result File "/app/xinference/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, **kwargs) File "/app/xinference/xinference/core/worker.py", line 569, in launch_builtin_model await model_ref.load() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.on_receive result = func(*args, **kwargs) File "/app/xinference/xinference/core/model.py", line 239, in load self._model.load() File "/app/xinference/xinference/model/llm/vllm/core.py", line 147, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 628, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 369, in _init_engine return engine_class(args, **kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 131, in init__ self._init_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 52, in init__ self.cpu_cache = self.allocate_cpu_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.self._lock None, inspect.iscoroutinefunction(fn) True inspect.isgenerator(ret) False “”“”