intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
142 stars 43 forks source link

[Performance] Register spill in flash attention forward kernel #2317

Open chengjunlu opened 1 month ago

chengjunlu commented 1 month ago

The register spill is mainly caused by the long liveness value defined by the tt.load operation in Triton kernel.

After re-schedule the tt.load operation close to its user can eliminate the register spill in large GRF mode.

IGC is working on enhancing the instruction scheduling when codegen.

Before it is available on IGC, we need a quick work around for the flash attention case on Triton side.

chengjunlu commented 1 month ago

Comments from @whitneywhtsang .

Note: New agama should be available this week, and it contains an improvement on instruction scheduling. IMO, we should rely on IGC instruction scheduling instead of implementing one in Triton.

Let's wait new agama.

chengjunlu commented 1 day ago

After closely worked with IGC team, we can reduce the register spill size to 0 on the default pass.

I need to double confirm the new internal IGC driver works properly to make sure all the changes are available in the latest IGC driver.

And then wait the new agama release.