Closed rkooo567 closed 3 months ago
Yep, can confirm. I think it's undocumented that using both together is not supported? I get this error on a dual 4090 machine:
2024-06-03T14:15:05.332567820Z raise CompilationError(fn.src, node, repr(e)) from e
2024-06-03T14:15:05.332573240Z triton.compiler.errors.CompilationError: at 114:24: off_v = (
2024-06-03T14:15:05.332578110Z bn[:, None] * stride_v_cache_bs +
2024-06-03T14:15:05.332588389Z cur_kv_head * stride_v_cache_h +
2024-06-03T14:15:05.332593288Z offs_d[None, :] * stride_v_cache_d +
2024-06-03T14:15:05.332598147Z (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
2024-06-03T14:15:05.332602996Z k = tl.load(K_cache + off_k,
2024-06-03T14:15:05.332607825Z mask=dim_mask[:, None] &
2024-06-03T14:15:05.332612695Z ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
2024-06-03T14:15:05.332617564Z other=0.0) # [D,N]
2024-06-03T14:15:05.332622383Z
2024-06-03T14:15:05.332627142Z qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N]
2024-06-03T14:15:05.332632532Z qk += tl.dot(q, k)
2024-06-03T14:15:05.332637411Z ^
2024-06-03T14:15:05.332642260Z AssertionError('Both operands must be same type. First operand (fp16) and second operand (uint8)')
Some other engine args that I used, in case they're relevant:
--quantization gptq --dtype float16 --enforce-eager --tensor-parallel-size 2
Let me make a PR to raise an error for now. cc @comaniac I believe you made this work before. Did you use kv cache dtype fp 8?
It should work with xformers backend with paged attention, but I'm not sure if that works with GPTQ.
Same issue here. I am using llama 3.1 8B which has a context length of 128k. Chunked prefill is automatically enabled for models over a certain sequence length (128k is over it) and I found that I had to set --enable-chunked-prefill False
in order to use --kv-cache-dtype fp8
That's not expected. I'll file a PR to automatically disable chunked prefill for now if fp8 kv-cache is enabled.
I know it's super long but here's the full trace:
(on a v100 tesla hence the fp16 instead of bf16)
I know it's super long but here's the full trace:
Seems like triton kernel issue, looks fixable. Let me take a look.
Also, is this comment still relevant?
This PR is directly relevant as the solution: https://github.com/vllm-project/vllm/pull/3234 (although the PR is incomplete and incorrect)
Additional context on FP8 KV Cache: https://github.com/vllm-project/vllm/issues/4532 with solution: https://github.com/vllm-project/vllm/pull/4893
Btw, why is this not on the testing path? Where should such a test be included as regression test?
Your current environment
H100 (but I believe it happens in any machine)
🐛 Describe the bug
Seems to be broken with some type incompatibility error.