Why does the running speed of triton in automatic tuning mode be much faster than that of directly giving the optimal configuration?

triton-lang / triton

Development repository for the Triton language and compiler

https://triton-lang.org/

MIT License

13.52k stars 1.67k forks source link

Why does the running speed of triton in automatic tuning mode be much faster than that of directly giving the optimal configuration? #4956

Open qiyuxinlin opened 1 month ago

qiyuxinlin commented 1 month ago

By printing the results of autotune, I got the best configuration, but if I use the best configuration directly and test within do_bench funciton, the speed is not as fast as in autotune mode, and the speed difference is much different. I want to know what additional operations have done by autotune. Does this mean I can't achieve the speed of autotune in production? Here is autotune give me the result: BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 3, enable_warp_specialization: False, enable_persistent: False

peterbell10 commented 1 month ago

In theory it should be running exactly the same code, but without a reproducer it's hard to tell. It's possible your benchmarks are confused by external factors like the caches being more warm for one run or the clock rate of the GPU changing between runs.

qiyuxinlin commented 1 month ago

I'm pretty sure it's not affected by other factors, because I tested it many times on an empty GPU, and the performance gap between with and without autotune is huge. The fused-attention given by the official website of my previous test was also better than flash- Attention.I suspect that it can only reach this speed in autotune, but I have not tested it

qiyuxinlin commented 1 month ago

Could it be that it caches some data when selecting the best configuration?

peterbell10 commented 1 month ago

Could you provide a script that reproduces the performance difference you're seeing?