intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
128 stars 37 forks source link

[Performance] Register spill in flash attention forward kernel #2317

Open chengjunlu opened 6 days ago

chengjunlu commented 6 days ago

The register spill is mainly caused by the long liveness value defined by the tt.load operation in Triton kernel.

After re-schedule the tt.load operation close to its user can eliminate the register spill in large GRF mode.

IGC is working on enhancing the instruction scheduling when codegen.

Before it is available on IGC, we need a quick work around for the flash attention case on Triton side.

chengjunlu commented 5 days ago

Comments from @whitneywhtsang .

Note: New agama should be available this week, and it contains an improvement on instruction scheduling. IMO, we should rely on IGC instruction scheduling instead of implementing one in Triton.

Let's wait new agama.