Open tchaton opened 1 year ago
It would be better if you can attach the kernel that caused the problem
Is it triton_flash_attn_kernel
?
Hey @jokeren, Yes. It is. I didn't add it as it was in the trace, apology for that. Here it is: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/transformer/inference/triton_ops.py#L119. Do you have any idea why there is a KeyError and how to debug it ?
It would be fantastic for me if I could run this model on T4.
Flash attention used to work on A100 only, for whatever reasons I don't remember clearly 🤣
@ptillet Is it still true?
Note: I tried the original Flash Attention and it seems to produce the same result as triton but it works on T4 and slightly faster. I am not blocked anymore but it would be great to have this resolved. https://github.com/Lightning-AI/stablediffusion/pull/8
If I remember correctly, the forward pass used to work on pre-Ampere hardware, but the backward pass only worked on post-Ampere hardware. It may be the case that now neither works on Turing :D I'm still working on A100 performance optimizations, but I agree that the forward pass should work well on all hardware. I don't think there's any major roadblock against this. I'll look into it.
Hey @ptillet. Thanks for the update. Please, ping me once you have a PR ready.
I wonder whether there is any progress to make the flash attention work with the latest triton in T4 GPU? is forward-passing working at least then?
The new tutorial should work on pre-Ampere hardware for the fwd pass https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py. Let me know if it doesn't
it failed in my test to run fwd with T4 my env: ubuntu22.04, installed cuda_11.7.1_515.65.01_linux the error message is as follows
[from the latest head in triton main repo at 34817ecc954a6f4ca7b4dfb352fdde1f8bd49ca5]
python: /mnt/styoun/projects/triton/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector
[triton-2.0.0.post1]
error: 'tt.reduce' op inferred type(s) 'tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #triton_gpu.mma<{versionMajor = 1, versionMinor = 0, warpsPerCTA = [4, 1]}>}>>' are incompatible with return type(s) of operation 'tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #triton_gpu.mma<{versionMajor = 1, versionMinor = 2, warpsPerCTA = [2, 2]}>}>>'
Traceback (most recent call last):
File "
[updates] tried the same from pytorch1.12.0+cu113 but failed again like before with the same error messages
is there a specific cuda version and a commit in the triton that makes the flash attention work in t4? then plz let me know
also tried in v100 with torch112cuda116+ latest release triton=2.0.0.post1 but failed
error: 'tt.reduce' op inferred type(s) 'tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #triton_gpu.mma<{versionMajor = 1, versionMinor = 0, warpsPerCTA = [4, 1]}>}>>' are incompatible with return type(s) of operation 'tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #triton_gpu.mma<{versionMajor = 1, versionMinor = 2, warpsPerCTA = [2, 2]}>}>>' Traceback (most recent call last):
Yes, we acknowledge the issue concerning the MMA-to-MMA conversion error. While it hasn't been our top priority due to existing workarounds, we have indeed raised its importance a bit. As such, we anticipate a resolution in the near future.
thanks for the reply and i wonder what's the workaround that i can use to make it run in T4, is it a specific release or commit in triton that works at least for flash attn forward passing?
I thought I mentioned it somewhere else but I cannot remember.
You are try to store the result of the dot to a piece of temporary global memory and then reload it to a tensor.
MMA conversion was supposedly fixed in #2627 I think?
I am trying to use DeepSpeed Inference with Diffusers on T4 GPU but it seems there is a triton error.
Reported the bug on DeepSpeed for better tracking: https://github.com/microsoft/DeepSpeed/issues/2702
Here is the error trace associated with the inference. It seems related to Triton caching.