Open chengjunlu opened 5 months ago
It is the highest priority for collecting accurate Triton performance data for the coming Triton Demo on Jun 25.
I have added post review comments to the PR that closed this issue, see https://github.com/intel/intel-xpu-backend-for-triton/pull/1136#discussion_r1605220737.
I am concerned the benchmarks use a different way than the do_bench
Triton uses to compute timing.
As the legacy profiler in public Torch doesn't support XPU, we need to use the Kineto to profile the Triton kernel. The alternative solution is to enable the Proton which is tracked in https://github.com/intel/intel-xpu-backend-for-triton/issues/1145
Change the title of this issue to be more precise.
The Pytorch Kineto for XPU requires a separate PTI package intel-pti-dev_p_0.9.0.32 which is not included PTDB package so far.
I will try it with Triton kernel to see if it works properly.
Add @ZzEeKkAa as assignee to this issue because he has already worked on the Kineto profiler integration to Triton.
Here is the PR #1905 from @ZzEeKkAa
Speaking of PTI, can we use it for the elapsed_time
. We would unblock long running https://github.com/pytorch/pytorch/pull/126456?
@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:
Speaking of PTI, can we use it for the
elapsed_time
. We would unblock long running pytorch/pytorch#126456?
The elapsed_time
is more general that can be used to profile the E2E GPU time including the bubble time which maybe caused by kernel scheduling bubble. I am not sure it is possible in PTI.
@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:
- Triton UT's
- Triton Tutorial's
- Torch Inductor UT's relevant to the Triton
- PyTorch/Benchmark E2E tests
I combine all the information here for performance profiling.
There are two ways used in Torch + Triton to measure GPU kernel performance:
elapsed_time
.Note: Pytorch XPU 2.5 doesn't support
elapsed_time
and raises exception. The Triton XPU work around it approximately by using the wall time instead of GPU time stamp. I will mark the use cases withTriton Workaround
for notice.
Here are the cases using 1st way:
Triton Workaround
Triton Workaround
Triton Workaround
Here are the cases using 2nd way:
Note: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh can be applied to allow using elapsed_time
for the cases that specified above as using 1st way. It is used in CI and developer script.
The Kineto is blocked by an issue in Intel PTI not able to trace the Triton kernel launched by SYCL API.
We can use the first way as a work around to get the approximate performance profiling with the patch https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh.
For the Pytorch 2.5 OOB supporting, we have to use the wall time as a work around for now.
The changes has been pushed to the PR https://github.com/intel/intel-xpu-backend-for-triton/pull/1905
There is no stand along profiler tools for Triton XPU now.
We used to use:
The Triton has a new component for profiling performance of the Triton kernel. It worth to support it for the Triton XPU.