triton-lang / triton

Development repository for the Triton language and compiler
https://triton-lang.org/
MIT License
12.53k stars 1.52k forks source link

CSE and LICM don't work as expected with exp in the loop #2961

Open Li-dongyang opened 7 months ago

Li-dongyang commented 7 months ago

I noticed that

CSE and LICM don't work as expected with exp in the loop

is mentioned in /python/triton/ops/flash_attention.py (credits to Adam P. Goucher @apgoucher )

Can someone explain to me the reason for saying this? Has this problem been solved? Thank you so much.

https://github.com/openai/triton/blob/e2bdc8973feb41fc60d31472bdbe3b80c3ad8405/python/triton/ops/flash_attention.py#L59-L63

lipracer commented 7 months ago

This may be an issue with the upstream MLIR, I will investigate first.

lipracer commented 7 months ago

I printed out mlir and found that the exp operation will be constructed in this form%146 = tt.extern_elementwise %145 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #mma}>>) -> tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #mma}>> loc(#loc36). Why not use the mlir.math dialect here? And I found that exp will be converted to exp2 in convert-triton-gpu-to-llvmpass. I don’t know much about this. Context, if we build math.exp directly and then convert it to exp2, this will not prevent the compiler optimization of mlir.