Open qiyuxinlin opened 1 month ago
In theory it should be running exactly the same code, but without a reproducer it's hard to tell. It's possible your benchmarks are confused by external factors like the caches being more warm for one run or the clock rate of the GPU changing between runs.
I'm pretty sure it's not affected by other factors, because I tested it many times on an empty GPU, and the performance gap between with and without autotune is huge. The fused-attention given by the official website of my previous test was also better than flash- Attention.I suspect that it can only reach this speed in autotune, but I have not tested it
Could it be that it caches some data when selecting the best configuration?
Could you provide a script that reproduces the performance difference you're seeing?
By printing the results of autotune, I got the best configuration, but if I use the best configuration directly and test within do_bench funciton, the speed is not as fast as in autotune mode, and the speed difference is much different. I want to know what additional operations have done by autotune. Does this mean I can't achieve the speed of autotune in production? Here is autotune give me the result: BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 3, enable_warp_specialization: False, enable_persistent: False