vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.4k stars 3.67k forks source link

[Bug]: illegal memory access when increase max_model_length on FP8 models #6429

Open IEI-mjx opened 1 month ago

IEI-mjx commented 1 month ago

Your current environment

# Using pip install vllm
vllm==v0.5.1

🐛 Describe the bug

# My python script to test long text
def run_Mixtral():
    tokenizer = AutoTokenizer.from_pretrained("/mnt/beegfs2/maojunxiong/Mixtral-8x7B-Instruct-v0.1-FP8-KV2", trust_remote_code=True)
    sampling_params = SamplingParams(max_tokens=100, temperature=1, top_p=0.01, top_k=1)

    llm = LLM(model="/mnt/beegfs2/maojunxiong/Mixtral-8x7B-Instruct-v0.1-FP8-KV2", tensor_parallel_size=4, disable_custom_all_reduce=True, max_num_seqs=1, enforce_eager=False, kv_cache_dtype='fp8)

    with open('/mnt/beegfs2/maojunxiong/long_code_test1.txt', 'r', encoding='utf-8') as file:
        lines = file.readlines()
        third_line = lines[30].strip()
        prompts = third_line.split('<sep>')[0].strip().replace("<n>", "\n")
        prompt_ids = tokenizer.encode(prompts, return_tensors="pt")
        num_tokens = len(prompt_ids[0])
        print("prompt_tokens:", num_tokens)

    outputs = []
    start_time = time.time()
    outputs = llm.generate(prompts, sampling_params)
    end_time = time.time()
    total_tokens = 0

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        token = len(tokenizer.encode(generated_text))
        total_tokens += token
        print(f"Generated text: {generated_text!r}")

    print("inference_time:", (end_time - start_time))
    print("total_tokens:", total_tokens)

if __name__ == "__main__":
    run_Mixtral()

When i set max_position_embeddings to 512K or higher, illegal memory access was encountered.

(VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error Internal, Traceback (most recent call last): (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output = executor(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker.py", line 173, in determine_num_available_blocks (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] self.model_runner.profile_run() (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 874, in profile_run (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 1243, in execute_model (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 348, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 276, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 222, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] hidden_states = self.self_attn( (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 168, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_workerutils.py:226] qkv, = self.qkv_proj(hidden_states) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/layers/linear.py", line 317, in forward (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output_parallel = self.quantmethod.apply(self, input, bias) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/layers/quantization/fp8.py", line 353, in apply (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output = ops.cutlass_scaled_mm( (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/_custom_ops.py", line 34, in wrapper (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return fn(args, kwargs) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/_custom_ops.py", line 236, in cutlass_scaled_mm (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in call (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_workerutils.py:226] return self._op(*args, (kwargs or {})) (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] RuntimeError: Error Internal (VllmWorkerProcess pid=29143) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error Internal, Traceback (most recent call last): (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] output = executor(*args, *kwargs) (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=29144) ERROR 07-15 09:41:46 multiproc_worker_utils.py:226] return func(args, kwargs)

robertgshaw2-neuralmagic commented 1 month ago

@comaniac FYI

comaniac commented 1 month ago

Will take a look when I get time. Meanwhile, I have 2 questions:

  1. Could you make this script self-contained (does not require other files such as prompt input)?
  2. Out of curiosity, Mixtral-8x7B should only support context length up to 32k. You set to 512k (and I assume the result quality is still acceptable to you) because of RoPE?
IEI-mjx commented 1 month ago

Will take a look when I get time. Meanwhile, I have 2 questions:

  1. Could you make this script self-contained (does not require other files such as prompt input)?
  2. Out of curiosity, Mixtral-8x7B should only support context length up to 32k. You set to 512k (and I assume the result quality is still acceptable to you) because of RoPE?

For Q1: Sure, you can substitute any other prompts input with this one(the prompt length is relative to max_model_len) For Q2: For now we don't care about the quality, we will use something like RoPE to improve the quality. I just want to test if FP8-KV-cache models can be extended to longer context length(maybe 1 times longer than the non-quantized models?)

by the way, i am testing the v0.5.2 version on this issue. I will update this issue if i get new result!

IEI-mjx commented 1 month ago

I also tested Llama3-8B-FP8-KV(generated using AutoFP8 https://github.com/neuralmagic/AutoFP8)with context length = 512K and encountered same error. When i tested Llama3-8B or Mixtral-8x7B with context length=512K, everything is ok, but failed with FP8-KV quantized version. I guess maybe the FP8 kernel does not support longtext?

comaniac commented 1 month ago

Thanks for the investigation. If that's the case then the problem is actually the paged attention kernel instead of the MoE kernel.

IEI-mjx commented 1 month ago

For v0.5.2 version, FP8 models can't be loaded successfully.

See trace back here: rank0: Traceback (most recent call last): rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/longtext_Mixtral_KV.py", line 37, in

rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/longtext_Mixtral_KV.py", line 9, in run_Mixtral rank0: llm = LLM(model="/mnt/beegfs2/maojunxiong/Mixtral-8x7B-Instruct-v0.1-FP8-KV2", tensor_parallel_size=8, disable_custom_all_reduce=True, max_num_seqs=1, enforce_eager=False, kv_cache_dtype='fp8') rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/entrypoints/llm.py", line 150, in init rank0: self.llm_engine = LLMEngine.from_engine_args( rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/engine/llm_engine.py", line 421, in from_engine_args rank0: engine = cls( rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/engine/llm_engine.py", line 249, in init rank0: self.model_executor = executor_class( rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/distributed_gpu_executor.py", line 25, in init rank0: super().init(*args, **kwargs) rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/executor_base.py", line 46, in init

rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/multiproc_gpu_executor.py", line 84, in _init_executor

rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, **kwargs) rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/worker/worker.py", line 139, in load_model

rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/worker/model_runner.py", line 256, in load_model rank0: self.model = get_model(model_config=self.model_config, rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/model_loader/init.py", line 21, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/model_loader/loader.py", line 270, in load_model

rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/models/mixtral.py", line 436, in load_weights rank0: weight_loader(param, loaded_weight) rank0: File "/mnt/beegfs2/maojunxiong/vllm_0.5.2/vllm/model_executor/model_loader/weight_utils.py", line 439, in default_weight_loader rank0: assert param.size() == loaded_weight.size()

comaniac commented 1 month ago

Does that work without fp8 kv-cache?

IEI-mjx commented 1 month ago

Does that work without fp8 kv-cache?

Yes. Only the fp8 kv_cache versions can't be loaded. I have tested fp8 kv_cache versions of Llama3 and Mixtral, both failed!

comaniac commented 1 month ago

Looks like the checkpoint format issue? cc @mgoin

mgoin commented 1 month ago

We identified this issue today unfortunately. I have resolved this on main in this PR https://github.com/vllm-project/vllm/pull/6081 However it does seem the 0.5.2 release was affected cc @robertgshaw2-neuralmagic for visibility. I would recommend using 0.5.1 or build main from source for kv cache checkpoints at the moment.

robertgshaw2-neuralmagic commented 1 month ago

Sorry guys. I broke this trying to get DeepSeek working. We should get a model with kv scales into the ci

IEI-mjx commented 1 month ago

We identified this issue today unfortunately. I have resolved this on main in this PR #6081 However it does seem the 0.5.2 release was affected cc @robertgshaw2-neuralmagic for visibility. I would recommend using 0.5.1 or build main from source for kv cache checkpoints at the moment.

I tested main (build from source) branch, fp8 kv_scale models could be loaded successfully! However, the issue I mentioned here wasn't solved yet. The same error traceback was encountered when I extended context length to 512K or longer (384K is ok) on Mixtral-8x7B-FP8-KV

IEI-mjx commented 1 month ago

vLLM version: 0.5.3.post1 Model: Mixtral-8x7B, Mixtral-8x7B-FP8 KV_cache Testing script: same as before

Result: For Mixtral-8x7B when i set context length to 1024K, vllm will stuck at processing prompt. As far as i know, when context length exceed the limited range, vllm will OOM or should display like “The model's max seq len (2097152) is larger than the maximum number of tokens that can be stored in KV cache (1475936)”. In vLLM_v0.5.2_main version, this is normal.

For Mixtral-8x7B-FP8 KV_cache, the inference with 4K context length is ok. But failed when I set it to 256K or 512K。 See traceback here: (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: at 114:24: off_v = ( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: at 114:24: off_v = ( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs + (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h + (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d + (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size stride_v_cache_bl) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k, (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size stride_v_cache_bl) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k, (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] & (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len), (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^ (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)'), Traceback (most recent call last): (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] & (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len), (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] generator.visit(fn.parse()) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N] (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^ (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ast.NodeVisitor.generic_visit(self, node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)'), Traceback (most recent call last): (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] generator.visit(fn.parse()) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ast.NodeVisitor.generic_visit(self, node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 426, in generic_visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(item) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 426, in generic_visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(item) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(assign) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] values = self.visit(node.value) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(assign) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] values = self.visit(node.value) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] rhs = self.visit(node.right) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] rhs = self.visit(node.right) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, extra_kwargs, kws) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, extra_kwargs, kws) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})" (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] The above exception was the direct cause of the following exception: (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})" (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Traceback (most recent call last): (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = executor(*args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] The above exception was the direct cause of the following exception: (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 65, in start_worker_execution_loop (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Traceback (most recent call last): (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 272, in execute_model (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.model_runner.execute_model( (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = executor(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 1314, in execute_model (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 65, in start_worker_execution_loop (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 272, in execute_model (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.model_runner.execute_model( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 1314, in execute_model (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 374, in forward (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 374, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 296, in forward (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 296, in forward (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 233, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.self_attn( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 233, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.self_attn( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 179, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 179, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/layer.py", line 97, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self.impl.forward(query, (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/backends/xformers.py", line 598, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] out = PagedAttention.forward_prefix( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/paged_attn.py", line 205, in forward_prefix (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] context_attention_fwd( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/layer.py", line 97, in forward (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self.impl.forward(query, (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/backends/xformers.py", line 598, in forward (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] _fwd_kernel[grid]( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] out = PagedAttention.forward_prefix( (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/paged_attn.py", line 205, in forward_prefix (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] context_attention_fwd( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return lambda *args, kwargs: self.run(grid=grid, warmup=False, *args, *kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.cache[device][key] = compile( (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(args, kwargs) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] _fwd_kernel[grid]( (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] module = src.make_ir(options) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return lambda *args, kwargs: self.run(grid=grid, warmup=False, *args, *kwargs) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.cache[device][key] = compile( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return ast_to_ttir(self.fn, self, options=options) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] module = src.make_ir(options) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] raise CompilationError(fn.src, node, repr(e)) from e (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] triton.compiler.errors.CompilationError: at 114:24: off_v = ( (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs + (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return ast_to_ttir(self.fn, self, options=options) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size stride_v_cache_bl) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k, (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] & (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len), (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] raise CompilationError(fn.src, node, repr(e)) from e (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k) (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^ (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] triton.compiler.errors.CompilationError: at 114:24: off_v = ( (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d + (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size stride_v_cache_bl) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k, (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] & (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len), (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)') (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N] (VllmWorkerProcess pid=15584) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k) (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^ (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)') (VllmWorkerProcess pid=15583) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: at 114:24: off_v = ( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs + (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h + (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d + (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k, (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] & (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len), (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^ (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)'), Traceback (most recent call last): (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] generator.visit(fn.parse()) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ast.NodeVisitor.generic_visit(self, node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 426, in generic_visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(item) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit_compound_statement(node.body) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret_type = self.visit(stmt) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.visit(assign) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] values = self.visit(node.value) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] rhs = self.visit(node.right) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ret = super().visit(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/lib/python3.10/ast.py", line 418, in visit (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return visitor(node) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(*args, extra_kwargs, kws) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return fn(*args, *kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})" (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] The above exception was the direct cause of the following exception: (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] Traceback (most recent call last): (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = executor(args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 65, in start_worker_execution_loop (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/worker_base.py", line 272, in execute_model (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] output = self.model_runner.execute_model( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/worker/model_runner.py", line 1314, in execute_model (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 374, in forward (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 296, in forward (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 233, in forward (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] hidden_states = self.self_attn( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/model_executor/models/mixtral.py", line 179, in forward (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/layer.py", line 97, in forward (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return self.impl.forward(query, (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/backends/xformers.py", line 598, in forward (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] out = PagedAttention.forward_prefix( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/paged_attn.py", line 205, in forward_prefix (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] context_attention_fwd( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/home/vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] _fwd_kernel[grid]( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return lambda *args, *kwargs: self.run(grid=grid, warmup=False, args, kwargs) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] self.cache[device][key] = compile( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] module = src.make_ir(options) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] return ast_to_ttir(self.fn, self, options=options) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] raise CompilationError(fn.src, node, repr(e)) from e (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] triton.compiler.errors.CompilationError: at 114:24: off_v = ( (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] bn[:, None] stride_v_cache_bs + (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] cur_kv_head stride_v_cache_h + (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] offs_d[None, :] stride_v_cache_d + (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size stride_v_cache_bl) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] k = tl.load(K_cache + off_k, (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] mask=dim_mask[:, None] & (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) < cur_batch_ctx_len), (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] other=0.0) # [D,N] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] qk += tl.dot(q, k) (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] ^ (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)') (VllmWorkerProcess pid=15582) ERROR 07-25 10:36:24 multiproc_worker_utils.py:226] rank0: Traceback (most recent call last): rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir

rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module rank0: ast.NodeVisitor.generic_visit(self, node) rank0: File "/usr/lib/python3.10/ast.py", line 426, in generic_visit

rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef

rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement rank0: ret_type = self.visit(stmt) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For

rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement rank0: ret_type = self.visit(stmt) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign

rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign rank0: values = self.visit(node.value) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp rank0: rhs = self.visit(node.right) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit rank0: ret = super().visit(node) rank0: File "/usr/lib/python3.10/ast.py", line 418, in visit rank0: return visitor(node) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call rank0: return fn(args, extra_kwargs, kws) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper rank0: return fn(args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot rank0: return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot rank0: assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid rank0: assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})" rank0: AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)

rank0: The above exception was the direct cause of the following exception:

rank0: Traceback (most recent call last): rank0: File "/home/vllm/longtext_Mixtral_KV.py", line 37, in

rank0: File "/home/vllm/longtext_Mixtral_KV.py", line 22, in run_Mixtral rank0: outputs = llm.generate(prompts, sampling_params) rank0: File "/home/vllm/vllm/utils.py", line 838, in inner rank0: return fn(*args, kwargs) rank0: File "/home/vllm/vllm/entrypoints/llm.py", line 316, in generate rank0: outputs = self._run_engine(use_tqdm=use_tqdm) rank0: File "/home/vllm/vllm/entrypoints/llm.py", line 569, in _run_engine rank0: step_outputs = self.llm_engine.step() rank0: File "/home/vllm/vllm/engine/llm_engine.py", line 911, in step rank0: output = self.model_executor.execute_model( rank0: File "/home/vllm/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model rank0: driver_outputs = self._driver_execute_model(execute_model_req) rank0: File "/home/vllm/vllm/executor/multiproc_gpu_executor.py", line 141, in _driver_execute_model rank0: return self.driver_worker.execute_model(execute_model_req) rank0: File "/home/vllm/vllm/worker/worker_base.py", line 272, in execute_model rank0: output = self.model_runner.execute_model( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/vllm/vllm/worker/model_runner.py", line 1314, in execute_model rank0: hidden_or_intermediate_states = model_executable( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 374, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 296, in forward rank0: hidden_states, residual = layer(positions, hidden_states, rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 233, in forward rank0: hidden_states = self.self_attn( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/vllm/vllm/model_executor/models/mixtral.py", line 179, in forward rank0: attn_output = self.attn(q, k, v, kv_cache, attn_metadata) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/vllm/vllm/attention/layer.py", line 97, in forward rank0: return self.impl.forward(query, rank0: File "/home/vllm/vllm/attention/backends/xformers.py", line 598, in forward rank0: out = PagedAttention.forward_prefix( rank0: File "/home/vllm/vllm/attention/ops/paged_attn.py", line 205, in forward_prefix

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd

rank0: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in rank0: return lambda *args, *kwargs: self.run(grid=grid, warmup=False, args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run rank0: self.cache[device][key] = compile( rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile rank0: module = src.make_ir(options) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir rank0: return ast_to_ttir(self.fn, self, options=options) rank0: File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir rank0: raise CompilationError(fn.src, node, repr(e)) from e rank0: triton.compiler.errors.CompilationError: at 114:24: off_v = ( rank0: bn[:, None] stride_v_cache_bs + rank0: cur_kv_head stride_v_cache_h + rank0: offs_d[None, :] stride_v_cache_d + rank0: (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl) rank0: k = tl.load(K_cache + off_k, rank0: mask=dim_mask[:, None] & rank0: ((start_n + offs_n[None, :]) < cur_batch_ctx_len), rank0: other=0.0) # [D,N]

rank0: qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # M,N: qk += tl.dot(q, k)

rank0: AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')

florianbaud commented 3 weeks ago

Hello, I had the same issue with the neuralmagic/Mistral-Nemo-Instruct-2407-FP8 model, and I found out that vLLM enables Chunked Prefill above 32k max_model_length. It seems the issue comes from this option. Try without it by setting the parameter enable_chunked_prefill to False.

robertgshaw2-neuralmagic commented 3 weeks ago

Hello, I had the same issue with the neuralmagic/Mistral-Nemo-Instruct-2407-FP8 model, and I found out that vLLM enables Chunked Prefill above 32k max_model_length. It seems the issue comes from this option. Try without it by setting the parameter enable_chunked_prefill to False.

@florianbaud are you running on v0.5.3.post1?

We resolved a difficult bug in cutlass with this commit: https://github.com/vllm-project/vllm/pull/6852, which will be in the next release

florianbaud commented 3 weeks ago

@robertgshaw2-neuralmagic, I run a docker image built from last commit (c8a7e93273ff4338d6f89f8a63ff16426ac240b8 Date: Wed Jul 31 23:51:09 2024).

robertgshaw2-neuralmagic commented 3 weeks ago

c8a7e93

Can you share the error message?

florianbaud commented 3 weeks ago

Yes, the error message is:

INFO 08-01 13:49:58 api_server.py:308] vLLM API server version 0.5.3.post1
INFO 08-01 13:49:58 api_server.py:309] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, model='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='fp8', quantization_param_path=None, max_model_len=61440, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=8, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=4, max_logprobs=20, disable_log_stats=False, quantization='fp8', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=61440, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-01 13:49:58 config.py:482] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 08-01 13:49:58 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-01 13:49:58 config.py:818] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-01 13:49:58 llm_engine.py:175] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', speculative_config=None, tokenizer='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=61440, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Mistral-Nemo-Instruct-2407-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-01 13:49:58 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
INFO 08-01 13:49:58 selector.py:54] Using XFormers backend.
INFO 08-01 13:49:59 model_runner.py:719] Starting to load model neuralmagic/Mistral-Nemo-Instruct-2407-FP8...
WARNING 08-01 13:49:59 fp8.py:43] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 08-01 13:49:59 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
INFO 08-01 13:49:59 selector.py:54] Using XFormers backend.
INFO 08-01 13:50:00 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.99it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.03it/s]
WARNING 08-01 13:50:01 utils.py:563] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
INFO 08-01 13:50:02 model_runner.py:731] Loading model weights took 12.9013 GB
INFO 08-01 13:50:02 gpu_executor.py:102] # GPU blocks: 7683, # CPU blocks: 6553
INFO 08-01 13:50:05 model_runner.py:1018] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-01 13:50:05 model_runner.py:1022] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-01 13:50:05 model_runner.py:1219] Graph capturing finished in 1 secs.
WARNING 08-01 13:50:05 serving_embedding.py:170] embedding_mode is False. Embedding API will not work.
INFO 08-01 13:50:05 api_server.py:284] Available routes are:
INFO 08-01 13:50:05 api_server.py:289] Route: /openapi.json, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /docs, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /redoc, Methods: HEAD, GET
INFO 08-01 13:50:05 api_server.py:289] Route: /health, Methods: GET
INFO 08-01 13:50:05 api_server.py:289] Route: /tokenize, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /detokenize, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/models, Methods: GET
INFO 08-01 13:50:05 api_server.py:289] Route: /version, Methods: GET
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/chat/completions, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/completions, Methods: POST
INFO 08-01 13:50:05 api_server.py:289] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 08-01 13:50:15 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-01 13:50:25 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-01 13:50:27 logger.py:36] Received request cmpl-d125a3a3f6eb46e694abf0b763a200ff-0: prompt: '<s>[INST]Hello[/INST]', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16384, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 3, 22177, 4], lora_request: None, prompt_adapter_request: None.
INFO 08-01 13:50:27 async_llm_engine.py:173] Added request cmpl-d125a3a3f6eb46e694abf0b763a200ff-0.
INFO:     172.11.2.3:46498 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR 08-01 13:50:28 async_llm_engine.py:56] Engine background task failed
ERROR 08-01 13:50:28 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
Exception in callback _log_task_completion(error_callback=<bound method...7fa8468bbfd0>>)(<Task finishe...d (uint8)')")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:36
ERROR 08-01 13:50:28 async_llm_engine.py:56]     generator.visit(fn.parse())
handle: <Handle _log_task_completion(error_callback=<bound method...7fa8468bbfd0>>)(<Task finishe...d (uint8)')")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:36>
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret = super().visit(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 418, in visit
    generator.visit(fn.parse())
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
    ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ast.NodeVisitor.generic_visit(self, node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
    return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     self.visit(item)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
  File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret = super().visit(node)
    self.visit(item)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 418, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return visitor(node)
    ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
  File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     self.visit_compound_statement(node.body)
    return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret_type = self.visit(stmt)
    self.visit_compound_statement(node.body)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret = super().visit(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
    ret_type = self.visit(stmt)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 418, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return visitor(node)
    ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
  File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     self.visit_compound_statement(node.body)
    return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret_type = self.visit(stmt)
    self.visit_compound_statement(node.body)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret = super().visit(node)
    ret_type = self.visit(stmt)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 418, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return visitor(node)
    ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
  File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     self.visit(assign)
    return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret = super().visit(node)
    self.visit(assign)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 418, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return visitor(node)
    ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
  File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     values = self.visit(node.value)
    return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
    values = self.visit(node.value)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret = super().visit(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return visitor(node)
    return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
ERROR 08-01 13:50:28 async_llm_engine.py:56]     rhs = self.visit(node.right)
    rhs = self.visit(node.right)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     ret = super().visit(node)
    ret = super().visit(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/ast.py", line 418, in visit
  File "/usr/lib/python3.10/ast.py", line 418, in visit
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
    return visitor(node)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return fn(*args, **extra_kwargs, **kws)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return fn(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
    return fn(*args, **extra_kwargs, **kws)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
  File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
ERROR 08-01 13:50:28 async_llm_engine.py:56]     assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
    return fn(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
  File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
ERROR 08-01 13:50:28 async_llm_engine.py:56] AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)
    return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
ERROR 08-01 13:50:28 async_llm_engine.py:56]
  File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
ERROR 08-01 13:50:28 async_llm_engine.py:56] The above exception was the direct cause of the following exception:
ERROR 08-01 13:50:28 async_llm_engine.py:56]
    assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR 08-01 13:50:28 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
  File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return_value = task.result()
    assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 641, in run_engine_loop
AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)

ERROR 08-01 13:50:28 async_llm_engine.py:56]     result = task.result()
The above exception was the direct cause of the following exception:

ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 584, in engine_step
Traceback (most recent call last):
ERROR 08-01 13:50:28 async_llm_engine.py:56]     request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async
    return_value = task.result()
ERROR 08-01 13:50:28 async_llm_engine.py:56]     output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 641, in run_engine_loop
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
    result = task.result()
ERROR 08-01 13:50:28 async_llm_engine.py:56]     output = await make_async(self.driver_worker.execute_model
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 584, in engine_step
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
    output = await self.model_executor.execute_model_async(
ERROR 08-01 13:50:28 async_llm_engine.py:56]     output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    output = await make_async(self.driver_worker.execute_model
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return func(*args, **kwargs)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1357, in execute_model
    result = self.fn(*self.args, **self.kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    output = self.model_runner.execute_model(
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
    return func(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     model_output = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1357, in execute_model
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
    hidden_or_intermediate_states = model_executable(
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]     hidden_states, residual = layer(
    return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]     hidden_states = self.self_attn(
    return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
    model_output = self.model(input_ids, positions, kv_caches,
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 175, in forward
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 97, in forward
    return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return self.impl.forward(query,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 175, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 598, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]     out = PagedAttention.forward_prefix(
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 08-01 13:50:28 async_llm_engine.py:56]     context_attention_fwd(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return func(*args, **kwargs)
    return self._call_impl(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 13:50:28 async_llm_engine.py:56]     _fwd_kernel[grid](
    return forward_call(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in <lambda>
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 97, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
    return self.impl.forward(query,
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 598, in forward
ERROR 08-01 13:50:28 async_llm_engine.py:56]     self.cache[device][key] = compile(
    out = PagedAttention.forward_prefix(
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
ERROR 08-01 13:50:28 async_llm_engine.py:56]     module = src.make_ir(options)
    context_attention_fwd(
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 08-01 13:50:28 async_llm_engine.py:56]     return ast_to_ttir(self.fn, self, options=options)
    return func(*args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
ERROR 08-01 13:50:28 async_llm_engine.py:56]     raise CompilationError(fn.src, node, repr(e)) from e
ERROR 08-01 13:50:28 async_llm_engine.py:56] triton.compiler.errors.CompilationError: at 114:24:        off_v = (
    _fwd_kernel[grid](
ERROR 08-01 13:50:28 async_llm_engine.py:56]             bn[:, None] * stride_v_cache_bs +
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in <lambda>
ERROR 08-01 13:50:28 async_llm_engine.py:56]             cur_kv_head * stride_v_cache_h +
ERROR 08-01 13:50:28 async_llm_engine.py:56]             offs_d[None, :] * stride_v_cache_d +
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
ERROR 08-01 13:50:28 async_llm_engine.py:56]             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
ERROR 08-01 13:50:28 async_llm_engine.py:56]         k = tl.load(K_cache + off_k,
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run
ERROR 08-01 13:50:28 async_llm_engine.py:56]                     mask=dim_mask[:, None] &
    self.cache[device][key] = compile(
ERROR 08-01 13:50:28 async_llm_engine.py:56]                     ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile
    module = src.make_ir(options)
ERROR 08-01 13:50:28 async_llm_engine.py:56]                     other=0.0)  # [D,N]
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir
    return ast_to_ttir(self.fn, self, options=options)
ERROR 08-01 13:50:28 async_llm_engine.py:56]
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir
    raise CompilationError(fn.src, node, repr(e)) from e
ERROR 08-01 13:50:28 async_llm_engine.py:56]         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
triton.compiler.errors.CompilationError: at 114:24:        off_v = (
ERROR 08-01 13:50:28 async_llm_engine.py:56]         qk += tl.dot(q, k)
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
ERROR 08-01 13:50:28 async_llm_engine.py:56]                         ^
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
ERROR 08-01 13:50:28 async_llm_engine.py:56] AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')
        k = tl.load(K_cache + off_k,
                    mask=dim_mask[:, None] &
INFO 08-01 13:50:28 async_llm_engine.py:180] Aborted request cmpl-d125a3a3f6eb46e694abf0b763a200ff-0.
                    ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
                    other=0.0)  # [D,N]

        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
        qk += tl.dot(q, k)
                        ^
AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1222, in ast_to_ttir
    generator.visit(fn.parse())
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 303, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/usr/lib/python3.10/ast.py", line 426, in generic_visit
    self.visit(item)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 376, in visit_FunctionDef
    self.visit_compound_statement(node.body)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
    ret_type = self.visit(stmt)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 885, in visit_For
    self.visit_compound_statement(node.body)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 298, in visit_compound_statement
    ret_type = self.visit(stmt)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 448, in visit_AugAssign
    self.visit(assign)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 428, in visit_Assign
    values = self.visit(node.value)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 477, in visit_BinOp
    rhs = self.visit(node.right)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1105, in visit
    ret = super().visit(node)
  File "/usr/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1027, in visit_Call
    return fn(*args, **extra_kwargs, **kws)
  File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1018, in dot
    return semantic.dot(input, other, acc, allow_tf32, max_num_imprecise_acc, out_dtype, _builder)
  File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1207, in dot
    assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
  File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1190, in assert_dtypes_valid
    assert lhs_dtype == rhs_dtype, f"Both operands must be same type. First operand ({lhs_dtype}) and second operand ({rhs_dtype})"
AssertionError: Both operands must be same type. First operand (bf16) and second operand (uint8)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__
    with collapse_excgroups():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 93, in collapse_excgroups
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 229, in completion_stream_generator
    async for prompt_idx, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 333, in consumer
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 324, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 308, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 776, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 892, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 888, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 93, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 641, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 584, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1357, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
    model_output = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 175, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 97, in forward
    return self.impl.forward(query,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 598, in forward
    out = PagedAttention.forward_prefix(
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
    context_attention_fwd(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
    _fwd_kernel[grid](
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416, in run
    self.cache[device][key] = compile(
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 191, in compile
    module = src.make_ir(options)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 117, in make_ir
    return ast_to_ttir(self.fn, self, options=options)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir
    raise CompilationError(fn.src, node, repr(e)) from e
triton.compiler.errors.CompilationError: at 114:24:        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k = tl.load(K_cache + off_k,
                    mask=dim_mask[:, None] &
                    ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
                    other=0.0)  # [D,N]

        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
        qk += tl.dot(q, k)
                        ^
AssertionError('Both operands must be same type. First operand (bf16) and second operand (uint8)')
mgoin commented 3 weeks ago
INFO 08-01 13:49:58 config.py:482] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 08-01 13:49:58 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-01 13:49:58 config.py:818] Chunked prefill is enabled with max_num_batched_tokens=512.

FP8 kv cache is being enabled, which seems to not be compatible with chunked prefill. I think we should disable it in this case.

EDIT: this doesn't seem to trigger for Llama FP8, so maybe it is an issue with MoE models