[Productize Flash Attention performance #5] replace Q with slm store/load in attention

This feature has a dependency to https://github.com/intel/intel-xpu-backend-for-triton/issues/1461. Get following failures now:

[convert-triton-to-tritongpu-warp]: ***********************************************

[convert-triton-to-tritongpu-warp]: this has tt.dot, but workload do not match any

[convert-triton-to-tritongpu-warp]: ***********************************************

Coredump:
LLVM ERROR: TritonGPU module should contain a triton_gpu.num-warps attribute

intel / intel-xpu-backend-for-triton

[Productize Flash Attention performance #5] replace Q with slm store/load in attention #1463