[Performance] Register spill in flash attention forward kernel

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

144 stars 44 forks source link

[Performance] Register spill in flash attention forward kernel #2317

Open chengjunlu opened 2 months ago

chengjunlu commented 2 months ago

The register spill is mainly caused by the long liveness value defined by the tt.load operation in Triton kernel.

After re-schedule the tt.load operation close to its user can eliminate the register spill in large GRF mode.

IGC is working on enhancing the instruction scheduling when codegen.

Before it is available on IGC, we need a quick work around for the flash attention case on Triton side.

chengjunlu commented 2 months ago

Comments from @whitneywhtsang .

Note: New agama should be available this week, and it contains an improvement on instruction scheduling. IMO, we should rely on IGC instruction scheduling instead of implementing one in Triton.

Let's wait new agama.

chengjunlu commented 2 weeks ago

After closely worked with IGC team, we can reduce the register spill size to 0 on the default pass.

I need to double confirm the new internal IGC driver works properly to make sure all the changes are available in the latest IGC driver.

And then wait the new agama release.