Open IEI-mjx opened 1 month ago
@comaniac FYI
Will take a look when I get time. Meanwhile, I have 2 questions:
Will take a look when I get time. Meanwhile, I have 2 questions:
- Could you make this script self-contained (does not require other files such as prompt input)?
- Out of curiosity, Mixtral-8x7B should only support context length up to 32k. You set to 512k (and I assume the result quality is still acceptable to you) because of RoPE?
For Q1: Sure, you can substitute any other prompts input with this one(the prompt length is relative to max_model_len) For Q2: For now we don't care about the quality, we will use something like RoPE to improve the quality. I just want to test if FP8-KV-cache models can be extended to longer context length(maybe 1 times longer than the non-quantized models?)
by the way, i am testing the v0.5.2 version on this issue. I will update this issue if i get new result!
I also tested Llama3-8B-FP8-KV(generated using AutoFP8 https://github.com/neuralmagic/AutoFP8)with context length = 512K and encountered same error. When i tested Llama3-8B or Mixtral-8x7B with context length=512K, everything is ok, but failed with FP8-KV quantized version. I guess maybe the FP8 kernel does not support longtext?
Thanks for the investigation. If that's the case then the problem is actually the paged attention kernel instead of the MoE kernel.
For v0.5.2 version, FP8 models can't be loaded successfully.
See trace back here:
rank0: Traceback (most recent call last):
rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/longtext_Mixtral_KV.py", line 37, in
rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/longtext_Mixtral_KV.py", line 9, in run_Mixtral rank0: llm = LLM(model="/mnt/beegfs2/maojunxiong/Mixtral-8x7B-Instruct-v0.1-FP8-KV2", tensor_parallel_size=8, disable_custom_all_reduce=True, max_num_seqs=1, enforce_eager=False, kv_cache_dtype='fp8') rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/entrypoints/llm.py", line 150, in init rank0: self.llm_engine = LLMEngine.from_engine_args( rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/engine/llm_engine.py", line 421, in from_engine_args rank0: engine = cls( rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/engine/llm_engine.py", line 249, in init rank0: self.model_executor = executor_class( rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/distributed_gpu_executor.py", line 25, in init rank0: super().init(*args, **kwargs) rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/executor_base.py", line 46, in init
rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/multiproc_gpu_executor.py", line 84, in _init_executor
rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, **kwargs) rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/worker/worker.py", line 139, in load_model
rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/worker/model_runner.py", line 256, in load_model rank0: self.model = get_model(model_config=self.model_config, rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/model_loader/init.py", line 21, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/model_loader/loader.py", line 270, in load_model
rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/models/mixtral.py", line 436, in load_weights rank0: weight_loader(param, loaded_weight) rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/model_loader/weight_utils.py", line 439, in default_weight_loader rank0: assert param.size() == loaded_weight.size()
Does that work without fp8 kv-cache?
Does that work without fp8 kv-cache?
Yes. Only the fp8 kv_cache versions can't be loaded. I have tested fp8 kv_cache versions of Llama3 and Mixtral, both failed!
Looks like the checkpoint format issue? cc @mgoin
We identified this issue today unfortunately. I have resolved this on main in this PR https://github.com/vllm-project/vllm/pull/6081 However it does seem the 0.5.2 release was affected cc @robertgshaw2-neuralmagic for visibility. I would recommend using 0.5.1 or build main from source for kv cache checkpoints at the moment.
Sorry guys. I broke this trying to get DeepSeek working. We should get a model with kv scales into the ci
We identified this issue today unfortunately. I have resolved this on main in this PR #6081 However it does seem the 0.5.2 release was affected cc @robertgshaw2-neuralmagic for visibility. I would recommend using 0.5.1 or build main from source for kv cache checkpoints at the moment.
I tested main (build from source) branch, fp8 kv_scale models could be loaded successfully! However, the issue I mentioned here wasn't solved yet. The same error traceback was encountered when I extended context length to 512K or longer (384K is ok) on Mixtral-8x7B-FP8-KV
vLLM version: 0.5.3.post1 Model: Mixtral-8x7B, Mixtral-8x7B-FP8 KV_cache Testing script: same as before
Result: For Mixtral-8x7B when i set context length to 1024K, vllm will stuck at processing prompt. As far as i know, when context length exceed the limited range, vllm will OOM or should display like “The model's max seq len (2097152) is larger than the maximum number of tokens that can be stored in KV cache (1475936)”. In vLLM_v0.5.2_main version, this is normal.
For Mixtral-8x7B-FP8 KV_cache, the inference with 4K context length is ok. But failed when I set it to 256K or 512K。
See traceback here:
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: at 114:24: off_v = (
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: at 114:24: off_v = (
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs +
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h +
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs +
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h +
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d +
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size stride_v_cache_bl)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k,
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d +
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size stride_v_cache_bl)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k,
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] &
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)'), Traceback (most recent call last):
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] &
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] generator.visit(fn.parse())
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N]
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ast.NodeVisitor.generic_visit(self, node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)'), Traceback (most recent call last):
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] generator.visit(fn.parse())
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ast.NodeVisitor.generic_visit(self, node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(item)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(item)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(assign)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] values = self.visit(node.value)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(assign)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] values = self.visit(node.value)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] rhs = self.visit(node.right)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] rhs = self.visit(node.right)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, extra_kwargs, kws)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, extra_kwargs, kws)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Traceback (most recent call last):
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = executor(*args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, *kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 65, in start_worker_execution_loop
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Traceback (most recent call last):
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 272, in execute_model
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.model_runner.execute_model(
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = executor(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, *kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 1314, in execute_model
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 65, in start_worker_execution_loop
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 272, in execute_model
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.model_runner.execute_model(
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 1314, in execute_model
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 374, in forward
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 374, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 296, in forward
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 296, in forward
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 233, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.self_attn(
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 233, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.self_attn(
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 179, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 179, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/layer.py", line 97, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self.impl.forward(query,
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/backends/xformers.py", line 598, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] out = PagedAttention.forward_prefix(
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/paged_attn.py", line 205, in forward_prefix
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] context_attention_fwd(
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs)
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/layer.py", line 97, in forward
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self.impl.forward(query,
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
(VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/backends/xformers.py", line 598, in forward
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] _fwd_kernel[grid](
(VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in
rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module rank0: ast.NodeVisitor.generic_visit(self, node) rank0: File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement rank0: ret_type = self.visit(stmt) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement rank0: ret_type = self.visit(stmt) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign rank0: values = self.visit(node.value) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp rank0: rhs = self.visit(node.right) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call rank0: return fn(args, extra_kwargs, kws) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper rank0: return fn(args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot rank0: return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot rank0: assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid rank0: assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})" rank0: AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)
rank0: The above exception was the direct cause of the following exception:
rank0: Traceback (most recent call last):
rank0: File "/home/vllm/longtext_Mixtral_KV.py", line 37, in
rank0: File "/home/vllm/longtext_Mixtral_KV.py", line 22, in run_Mixtral rank0: outputs = llm.generate(prompts, sampling_params) rank0: File "/home/vllm/vllm/utils.py", line 838, in inner rank0: return fn(*args, kwargs) rank0: File "/home/vllm/vllm/entrypoints/llm.py", line 316, in generate rank0: outputs = self._run_engine(use_tqdm=use_tqdm) rank0: File "/home/vllm/vllm/entrypoints/llm.py", line 569, in _run_engine rank0: step_outputs = self.llm_engine.step() rank0: File "/home/vllm/vllm/engine/llm_engine.py", line 911, in step rank0: output = self.model_executor.execute_model( rank0: File "/home/vllm/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model rank0: driver_outputs = self._driver_execute_model(execute_model_req) rank0: File "/home/vllm/vllm/executor/multiproc_gpu_executor.py", line 141, in _driver_execute_model rank0: return self.driver_worker.execute_model(execute_model_req) rank0: File "/home/vllm/vllm/worker/worker_base.py", line 272, in execute_model rank0: output = self.model_runner.execute_model( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/vllm/vllm/worker/model_runner.py", line 1314, in execute_model rank0: hidden_or_intermediate_states = model_executable( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 374, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 296, in forward rank0: hidden_states, residual = layer(positions, hidden_states, rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 233, in forward rank0: hidden_states = self.self_attn( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 179, in forward rank0: attn_output = self.attn(q, k, v, kv_cache, attn_metadata) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/vllm/vllm/attention/layer.py", line 97, in forward rank0: return self.impl.forward(query, rank0: File "/home/vllm/vllm/attention/backends/xformers.py", line 598, in forward rank0: out = PagedAttention.forward_prefix( rank0: File "/home/vllm/vllm/attention/ops/paged_attn.py", line 205, in forward_prefix
rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
rank0: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in
rank0: qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # M,N: qk += tl.dot(q, k)
rank0: AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')
Hello, I had the same issue with the neuralmagic/Mistral-Nemo-Instruct-2407-FP8 model, and I found out that vLLM enables Chunked Prefill above 32k max_model_length
. It seems the issue comes from this option. Try without it by setting the parameter enable_chunked_prefill
to False.
Hello, I had the same issue with the neuralmagic/Mistral-Nemo-Instruct-2407-FP8 model, and I found out that vLLM enables Chunked Prefill above 32k
max_model_length
. It seems the issue comes from this option. Try without it by setting the parameterenable_chunked_prefill
to False.
@florianbaud are you running on v0.5.3.post1?
We resolved a difficult bug in cutlass with this commit: https://github.com/vllm-project/vllm/pull/6852, which will be in the next release
@robertgshaw2-neuralmagic, I run a docker image built from last commit (c8a7e93273ff4338d6f89f8a63ff16426ac240b8 Date: Wed Jul 31 23:51:09 2024).
c8a7e93
Can you share the error message?
Yes, the error message is:
INFO 08-01 13:49:58 api_server.py:308] vLLM API server version 0.5.3.post1
INFO 08-01 13:49:58 api_server.py:309] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, model='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='fp8', quantization_param_path=None, max_model_len=61440, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=8, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=4, max_logprobs=20, disable_log_stats=False, quantization='fp8', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=61440, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-01 13:49:58 config.py:482] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 08-01 13:49:58 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-01 13:49:58 config.py:818] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-01 13:49:58 llm_engine.py:175] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', speculative_config=None, tokenizer='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=61440, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Mistral-Nemo-Instruct-2407-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-01 13:49:58 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
INFO 08-01 13:49:58 selector.py:54] Using XFormers backend.
INFO 08-01 13:49:59 model_runner.py:719] Starting to load model neuralmagic/Mistral-Nemo-Instruct-2407-FP8...
WARNING 08-01 13:49:59 fp8.py:43] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 08-01 13:49:59 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
INFO 08-01 13:49:59 selector.py:54] Using XFormers backend.
INFO 08-01 13:50:00 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.99it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.03it/s]
WARNING 08-01 13:50:01 utils.py:563] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
INFO 08-01 13:50:02 model_runner.py:731] Loading model weights took 12.9013 GB
INFO 08-01 13:50:02 gpu_executor.py:102] # GPU blocks: 7683, # CPU blocks: 6553
INFO 08-01 13:50:05 model_runner.py:1018] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-01 13:50:05 model_runner.py:1022] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-01 13:50:05 model_runner.py:1219] Graph capturing finished in 1 secs.
WARNING 08-01 13:50:05 serving_embedding.py:170] embedding_mode is False. Embedding API will not work.
INFO 08-01 13:50:05 api_server.py:284] Available routes are:
INFO 08-01 13:50:05 api_server.py:289] Route: /openapi.json, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /docs, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /redoc, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /health, Methods: GET
INFO 08-01 13:50:05 api_server.py:289] Route: /tokenize, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /detokenize, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/models, Methods: GET
INFO 08-01 13:50:05 api_server.py:289] Route: /version, Methods: GET
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/chat/completions, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/completions, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/embeddings, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 08-01 13:50:15 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-01 13:50:25 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-01 13:50:27 logger.py:36] Received request cmpl-d125a3a3f6eb46e694abf0b763a200ff-0: prompt: '<s>[INST]Hello[/INST]', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16384, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 3, 22177, 4], lora_request: None, prompt_adapter_request: None.
INFO 08-01 13:50:27 async_llm_engine.py:173] Added request cmpl-d125a3a3f6eb46e694abf0b763a200ff-0.
INFO: 172.11.2.3:46498 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR 08-01 13:50:28 async_llm_engine.py:56] Engine background task failed
ERROR 08-01 13:50:28 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
Exception in callback _log_task_completion(error_callback=<bound method...7fa8468bbfd0>>)(<Task finishe...d (uint8)')")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:36
ERROR 08-01 13:50:28 async_llm_engine.py:56] generator.visit(fn.parse())
handle: <Handle _log_task_completion(error_callback=<bound method...7fa8468bbfd0>>)(<Task finishe...d (uint8)')")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:36>
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret = super().visit(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 418, in visit
generator.visit(fn.parse())
ERROR 08-01 13:50:28 async_llm_engine.py:56] return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] ast.NodeVisitor.generic_visit(self, node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] self.visit(item)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
ast.NodeVisitor.generic_visit(self, node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret = super().visit(node)
self.visit(item)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 418, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] return visitor(node)
ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] self.visit_compound_statement(node.body)
return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret_type = self.visit(stmt)
self.visit_compound_statement(node.body)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret = super().visit(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
ret_type = self.visit(stmt)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 418, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] return visitor(node)
ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] self.visit_compound_statement(node.body)
return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret_type = self.visit(stmt)
self.visit_compound_statement(node.body)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret = super().visit(node)
ret_type = self.visit(stmt)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 418, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] return visitor(node)
ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] self.visit(assign)
return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret = super().visit(node)
self.visit(assign)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 418, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] return visitor(node)
ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] values = self.visit(node.value)
return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
values = self.visit(node.value)
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret = super().visit(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] return visitor(node)
return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
ERROR 08-01 13:50:28 async_llm_engine.py:56] rhs = self.visit(node.right)
rhs = self.visit(node.right)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] ret = super().visit(node)
ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/ast.py", line 418, in visit
File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56] return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56] return fn(*args, **extra_kwargs, **kws)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
ERROR 08-01 13:50:28 async_llm_engine.py:56] return fn(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
return fn(*args, **extra_kwargs, **kws)
ERROR 08-01 13:50:28 async_llm_engine.py:56] return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
ERROR 08-01 13:50:28 async_llm_engine.py:56] assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
return fn(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
ERROR 08-01 13:50:28 async_llm_engine.py:56] AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)
return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
ERROR 08-01 13:50:28 async_llm_engine.py:56]
File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
ERROR 08-01 13:50:28 async_llm_engine.py:56] The above exception was the direct cause of the following exception:
ERROR 08-01 13:50:28 async_llm_engine.py:56]
assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR 08-01 13:50:28 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
ERROR 08-01 13:50:28 async_llm_engine.py:56] return_value = task.result()
assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 641, in run_engine_loop
AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)
ERROR 08-01 13:50:28 async_llm_engine.py:56] result = task.result()
The above exception was the direct cause of the following exception:
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 584, in engine_step
Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56] request_outputs = await self.engine.step_async(virtual_engine)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async
return_value = task.result()
ERROR 08-01 13:50:28 async_llm_engine.py:56] output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 641, in run_engine_loop
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
result = task.result()
ERROR 08-01 13:50:28 async_llm_engine.py:56] output = await make_async(self.driver_worker.execute_model
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 584, in engine_step
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-01 13:50:28 async_llm_engine.py:56] result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
output = await self.model_executor.execute_model_async(
ERROR 08-01 13:50:28 async_llm_engine.py:56] output = self.model_runner.execute_model(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
output = await make_async(self.driver_worker.execute_model
ERROR 08-01 13:50:28 async_llm_engine.py:56] return func(*args, **kwargs)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1357, in execute_model
result = self.fn(*self.args, **self.kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] hidden_or_intermediate_states = model_executable(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
output = self.model_runner.execute_model(
ERROR 08-01 13:50:28 async_llm_engine.py:56] return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 08-01 13:50:28 async_llm_engine.py:56] return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
return func(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] model_output = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1357, in execute_model
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] return self._call_impl(*args, **kwargs)
hidden_or_intermediate_states = model_executable(
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] hidden_states, residual = layer(
return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] hidden_states = self.self_attn(
return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] return self._call_impl(*args, **kwargs)
model_output = self.model(input_ids, positions, kv_caches,
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] return forward_call(*args, **kwargs)
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 175, in forward
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] return self._call_impl(*args, **kwargs)
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] return forward_call(*args, **kwargs)
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 97, in forward
return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] return self.impl.forward(query,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 175, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 598, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] out = PagedAttention.forward_prefix(
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 08-01 13:50:28 async_llm_engine.py:56] context_attention_fwd(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 08-01 13:50:28 async_llm_engine.py:56] return func(*args, **kwargs)
return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56] _fwd_kernel[grid](
return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in <lambda>
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 97, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
return self.impl.forward(query,
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 598, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56] self.cache[device][key] = compile(
out = PagedAttention.forward_prefix(
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
ERROR 08-01 13:50:28 async_llm_engine.py:56] module = src.make_ir(options)
context_attention_fwd(
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 08-01 13:50:28 async_llm_engine.py:56] return ast_to_ttir(self.fn, self, options=options)
return func(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
ERROR 08-01 13:50:28 async_llm_engine.py:56] raise CompilationError(fn.src, node, repr(e)) from e
ERROR 08-01 13:50:28 async_llm_engine.py:56] triton.compiler.errors.CompilationError: at 114:24: off_v = (
_fwd_kernel[grid](
ERROR 08-01 13:50:28 async_llm_engine.py:56] bn[:, None] * stride_v_cache_bs +
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in <lambda>
ERROR 08-01 13:50:28 async_llm_engine.py:56] cur_kv_head * stride_v_cache_h +
ERROR 08-01 13:50:28 async_llm_engine.py:56] offs_d[None, :] * stride_v_cache_d +
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56] (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
ERROR 08-01 13:50:28 async_llm_engine.py:56] k = tl.load(K_cache + off_k,
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run
ERROR 08-01 13:50:28 async_llm_engine.py:56] mask=dim_mask[:, None] &
self.cache[device][key] = compile(
ERROR 08-01 13:50:28 async_llm_engine.py:56] ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile
module = src.make_ir(options)
ERROR 08-01 13:50:28 async_llm_engine.py:56] other=0.0) # [D,N]
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir
return ast_to_ttir(self.fn, self, options=options)
ERROR 08-01 13:50:28 async_llm_engine.py:56]
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir
raise CompilationError(fn.src, node, repr(e)) from e
ERROR 08-01 13:50:28 async_llm_engine.py:56] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N]
triton.compiler.errors.CompilationError: at 114:24: off_v = (
ERROR 08-01 13:50:28 async_llm_engine.py:56] qk += tl.dot(q, k)
bn[:, None] * stride_v_cache_bs +
cur_kv_head * stride_v_cache_h +
ERROR 08-01 13:50:28 async_llm_engine.py:56] ^
offs_d[None, :] * stride_v_cache_d +
(start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
ERROR 08-01 13:50:28 async_llm_engine.py:56] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')
k = tl.load(K_cache + off_k,
mask=dim_mask[:, None] &
INFO 08-01 13:50:28 async_llm_engine.py:180] Aborted request cmpl-d125a3a3f6eb46e694abf0b763a200ff-0.
((start_n + offs_n[None, :]) < cur_batch_ctx_len),
other=0.0) # [D,N]
qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N]
qk += tl.dot(q, k)
^
AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
generator.visit(fn.parse())
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
ast.NodeVisitor.generic_visit(self, node)
File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
self.visit(item)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
self.visit_compound_statement(node.body)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
ret_type = self.visit(stmt)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
self.visit_compound_statement(node.body)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
ret_type = self.visit(stmt)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
self.visit(assign)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
values = self.visit(node.value)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
rhs = self.visit(node.right)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ret = super().visit(node)
File "/usr/lib/python3.10/ast.py", line 418, in visit
return visitor(node)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
return fn(*args, **extra_kwargs, **kws)
File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__
with collapse_excgroups():
File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 93, in collapse_excgroups
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 229, in completion_stream_generator
async for prompt_idx, res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 333, in consumer
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 324, in consumer
raise item
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 308, in producer
async for item in iterator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 776, in generate
async for output in self._process_request(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 892, in _process_request
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 888, in _process_request
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 93, in __anext__
raise result
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 641, in run_engine_loop
result = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 584, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
output = await make_async(self.driver_worker.execute_model
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
output = self.model_runner.execute_model(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1357, in execute_model
hidden_or_intermediate_states = model_executable(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
model_output = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 175, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 97, in forward
return self.impl.forward(query,
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 598, in forward
out = PagedAttention.forward_prefix(
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
context_attention_fwd(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
_fwd_kernel[grid](
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run
self.cache[device][key] = compile(
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile
module = src.make_ir(options)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir
return ast_to_ttir(self.fn, self, options=options)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir
raise CompilationError(fn.src, node, repr(e)) from e
triton.compiler.errors.CompilationError: at 114:24: off_v = (
bn[:, None] * stride_v_cache_bs +
cur_kv_head * stride_v_cache_h +
offs_d[None, :] * stride_v_cache_d +
(start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
k = tl.load(K_cache + off_k,
mask=dim_mask[:, None] &
((start_n + offs_n[None, :]) < cur_batch_ctx_len),
other=0.0) # [D,N]
qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N]
qk += tl.dot(q, k)
^
AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')
INFO 08-01 13:49:58 config.py:482] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 08-01 13:49:58 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-01 13:49:58 config.py:818] Chunked prefill is enabled with max_num_batched_tokens=512.
FP8 kv cache is being enabled, which seems to not be compatible with chunked prefill. I think we should disable it in this case.
EDIT: this doesn't seem to trigger for Llama FP8, so maybe it is an issue with MoE models
Your current environment
🐛 Describe the bug
When i set max_position_embeddings to 512K or higher, illegal memory access was encountered.
(VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error Internal, Traceback (most recent call last): (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output = executor(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker.py", line 173, in determine_num_available_blocks (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] self.model_runner.profile_run() (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 874, in profile_run (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 1243, in execute_model (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 348, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 276, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 222, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_states = self.self_attn( (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 168, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_workerutils.py:226] qkv, = self.qkv_proj(hidden_states) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/layers/linear.py", line 317, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output_parallel = self.quantmethod.apply(self, input, bias) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/layers/quantization/fp8.py", line 353, in apply (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output = ops.cutlass_scaled_mm( (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/_custom_ops.py", line 34, in wrapper (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return fn(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/_custom_ops.py", line 236, in cutlass_scaled_mm (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in call (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_workerutils.py:226] return self._op(*args, (kwargs or {})) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] RuntimeError: Error Internal (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error Internal, Traceback (most recent call last): (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output = executor(*args, *kwargs) (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(args, kwargs)