[Bug]: CUDA error: an illegal memory access was encountered

wangye360 commented 2 months ago

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.31 Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7763 64-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 3091.521 CPU max MHz: 2450.0000 CPU min MHz: 1500.0000 BogoMIPS: 4890.87 Virtualization: AMD-V L1d cache: 4 MiB L1i cache: 4 MiB L2 cache: 64 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 NUMA node2 CPU(s): 64-95 NUMA node3 CPU(s): 96-127 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca sme sev sev_es Versions of relevant libraries: [pip3] flashinfer==0.0.9+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.0.3 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.42.4 [pip3] triton==2.3.1 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3@ vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 SYS PXB SYS SYS 0-31 0 N/A GPU1 NV12 X SYS PXB SYS SYS 0-31 0 N/A NIC0 SYS SYS X SYS SYS SYS NIC1 PXB PXB SYS X SYS SYS NIC2 SYS SYS SYS SYS X PIX NIC3 SYS SYS SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 ```

🐛 Describe the bug

The code cannot be shown due to corporate confidentiality reasons, and the problem is: for qwen2-72b-gptq-int4 as model and qwen2-7b-gptq-int8 as draft model, when spleculative decoding, a crash occurred after a short run in high concurrency. And the weird thing is, sometimes it crashes, sometimes it doesn't, it's random.

server： current_dir=pwd

cd /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils if [ ! -f marlin_utils.py_bak ];then mv marlin_utils.py marlin_utils.py_bak fi sed 's/GPTQ_MARLIN_MIN_THREAD_K = 128/GPTQ_MARLIN_MIN_THREAD_K = 64/g' marlin_utils.py_bak | tee marlin_utils.py > /dev/null

cd $current_dir

python3 -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 8080 \ --served-model-name Qwen2-72B \ --model model \ --disable-log-requests \ --tensor-parallel-size 2 \ --kv-cache-dtype auto \ --gpu-memory-utilization 0.95 \ --enable-prefix-caching \ --use-v2-block-manager \ --speculative-model speculative_model \ --num-speculative-tokens 5 \ --max-model-len 20000 \

the log is : INFO: 127.0.0.1:41774 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:41818 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:41840 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:41854 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:41874 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:41894 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:41814 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:41828 - "POST /v1/chat/completions HTTP/1.1" 200 OK DEBUG 08-15 03:15:44 async_llm_engine.py:606] Waiting for new requests... DEBUG 08-15 03:15:44 async_llm_engine.py:620] Got new requests! INFO 08-15 03:15:51 metrics.py:396] Avg prompt throughput: 1366.2 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.2%, CPU KV cache usage: 0.0%. ERROR 08-15 03:15:55 async_llm_engine.py:56] Engine background task failed ERROR 08-15 03:15:55 async_llm_engine.py:56] Traceback (most recent call last): ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion ERROR 08-15 03:15:55 async_llm_engine.py:56] return_value = task.result() ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 637, in run_engine_loop ERROR 08-15 03:15:55 async_llm_engine.py:56] result = task.result() ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 580, in engine_step ERROR 08-15 03:15:55 async_llm_engine.py:56] request_outputs = await self.engine.step_async(virtual_engine) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async ERROR 08-15 03:15:55 async_llm_engine.py:56] output = await self.model_executor.execute_model_async( ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async ERROR 08-15 03:15:55 async_llm_engine.py:56] return await self._driver_execute_model_async(execute_model_req) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 210, in _driver_execute_model_async ERROR 08-15 03:15:55 async_llm_engine.py:56] return await self.driver_exec_model(execute_model_req) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run ERROR 08-15 03:15:55 async_llm_engine.py:56] result = self.fn(*self.args, self.kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context ERROR 08-15 03:15:55 async_llm_engine.py:56] return func(*args, *kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 373, in execute_model ERROR 08-15 03:15:55 async_llm_engine.py:56] return self._run_no_spec(execute_model_req, ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/lib/python3.10/contextlib.py", line 79, in inner ERROR 08-15 03:15:55 async_llm_engine.py:56] return func(args, kwds) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 456, in _run_no_spec ERROR 08-15 03:15:55 async_llm_engine.py:56] sampler_output = self.scorer_worker.execute_model(execute_model_req) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 272, in execute_model ERROR 08-15 03:15:55 async_llm_engine.py:56] output = self.model_runner.execute_model( ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context ERROR 08-15 03:15:55 async_llm_engine.py:56] return func(*args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1314, in execute_model ERROR 08-15 03:15:55 async_llm_engine.py:56] hidden_or_intermediate_states = model_executable( ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return self._call_impl(*args, *kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return forward_call(args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 336, in forward ERROR 08-15 03:15:55 async_llm_engine.py:56] hidden_states = self.model(input_ids, positions, kv_caches, ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return self._call_impl(*args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return forward_call(*args, *kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 257, in forward ERROR 08-15 03:15:55 async_llm_engine.py:56] hidden_states, residual = layer( ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return self._call_impl(args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return forward_call(*args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 219, in forward ERROR 08-15 03:15:55 async_llm_engine.py:56] hidden_states = self.mlp(hidden_states) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return self._call_impl(*args, *kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return forward_call(args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 80, in forward ERROR 08-15 03:15:55 async_llmengine.py:56] x, = self.down_proj(x) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return self._call_impl(*args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 08-15 03:15:55 async_llm_engine.py:56] return forward_call(*args, *kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 783, in forward ERROR 08-15 03:15:55 async_llm_engine.py:56] output_parallel = self.quant_method.apply(self, ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 313, in apply ERROR 08-15 03:15:55 async_llm_engine.py:56] return apply_gptq_marlin_linear( ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 251, in apply_gptq_marlin_linear ERROR 08-15 03:15:55 async_llm_engine.py:56] output = ops.gptq_marlin_gemm(reshaped_x, ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34, in wrapper ERROR 08-15 03:15:55 async_llm_engine.py:56] return fn(args, kwargs) ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 291, in gptq_marlin_gemm ERROR 08-15 03:15:55 async_llm_engine.py:56] return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros, ERROR 08-15 03:15:55 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in call ERROR 08-15 03:15:55 async_llmengine.py:56] return self._op(*args, (kwargs or {})) ERROR 08-15 03:15:55 async_llm_engine.py:56] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 494.00 MiB. GPU Exception in callback functools.partial(<function _log_task_completion at 0x7f7df55ad2d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f7ddce9c190>>) handle: <Handle functools.partial(<function _log_task_completion at 0x7f7df55ad2d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f7ddce9c190>>)> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion return_value = task.result() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 637, in run_engine_loop result = task.result() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 580, in engine_step request_outputs = await self.engine.step_async(virtual_engine) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async output = await self.model_executor.execute_model_async( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async return await self._driver_execute_model_async(execute_model_req) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 210, in _driver_execute_model_async return await self.driver_exec_model(execute_model_req) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, *self.kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 373, in execute_model return self._run_no_spec(execute_model_req, File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, kwds) File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 456, in _run_no_spec sampler_output = self.scorer_worker.execute_model(execute_model_req) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 272, in execute_model output = self.model_runner.execute_model( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1314, in execute_model hidden_or_intermediate_states = model_executable( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 336, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 257, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 219, in forward hidden_states = self.mlp(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/modelexecutor/models/qwen2.py", line 80, in forward x, = self.down_proj(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 783, in forward output_parallel = self.quant_method.apply(self, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 313, in apply return apply_gptq_marlin_linear( File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 251, in apply_gptq_marlin_linear output = ops.gptq_marlin_gemm(reshaped_x, File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34, in wrapper return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 291, in gptq_marlin_gemm return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros, File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 854, in call return self._op(args, **(kwargs or {})) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 494.00 MiB. GPU

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause. INFO: 127.0.0.1:41774 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

jeejeelee commented 2 months ago

First, you should avoid th following OOM error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 494.00 MiB. GPU

wangye360 commented 2 months ago

First, you should avoid th following OOM error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 494.00 MiB. GPU

It has run successfully and responds correctly. And later, CUDA out of memory. Is there a leak of memory.

jeejeelee commented 2 months ago

First, you should avoid th following OOM error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 494.00 MiB. GPU

It has run successfully and responds correctly. And later, CUDA out of memory. Is there a leak of memory.

This is typically caused by insufficient GPU memory.

Sekri0 commented 2 months ago

set --gpu-memory-utilization to a lower value (such as 0.8) may solve the problem.

TangJiakai commented 1 month ago

set --gpu-memory-utilization to a lower value (such as 0.8) may solve the problem.

I have set gpu-memory-utilization as 0.65, but errors still happened.

Sekri0 commented 1 month ago

set --gpu-memory-utilization to a lower value (such as 0.8) may solve the problem.

I have set gpu-memory-utilization as 0.65, but errors still happened.

I mean lowering gpu utilization would solve the cuda oom problem.

TangJiakai commented 1 month ago

set --gpu-memory-utilization to a lower value (such as 0.8) may solve the problem.

I have set gpu-memory-utilization as 0.65, but errors still happened.

I mean lowering gpu utilization would solve the cuda oom problem.

Oh, yes, You're right! My error is not OOM, but just report CUDA error: an illegal memory access was encountered

jeejeelee commented 1 month ago

set --gpu-memory-utilization to a lower value (such as 0.8) may solve the problem.

I have set gpu-memory-utilization as 0.65, but errors still happened.

I mean lowering gpu utilization would solve the cuda oom problem.

Oh, yes, You're right! My error is not OOM, but just report CUDA error: an illegal memory access was encountered

Could you provide more details about this error, including the environment information and error traceback?

vllm-project / vllm

[Bug]: CUDA error: an illegal memory access was encountered #7539

Your current environment

🐛 Describe the bug