[Benchmarks][Upstream PyTorch 2.5] `Triton` and `XeTLA` softmax performance degrades in comparison with `torch 2.1` / `ipex 2.1` test proxies

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

124 stars 35 forks source link

Open ESI-SYD opened 1 week ago

ESI-SYD commented 1 week ago

Ratio of Triton/ XeTLA keep same except for attention caused by XeTLA attention absolute number degraded
Both Triton and XeTLA softmax cases degraded, so Triton/ XeTLA not changed.

vlad-penkin commented 1 week ago

@ESI-SYD what is the root cause for this issue? can you pin point it to a particular torch operation?

@anmyachev to proceed further with analysis / triaging please create a minimal reproducer for the Triton kernel path.

ESI-SYD commented 1 week ago

@ESI-SYD what is the root cause for this issue? can you pin point it to a particular torch operation?

There are two main differences in benchmark time method change after applying the Draft

No sync submitting. https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/python/triton/testing.py#L214
Use the time stamp between two barriers which is not accurate. Previous detailed explanation by chengjun.

anmyachev commented 6 days ago