[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time.

chengjunlu commented 5 months ago

There is no stand along profiler tools for Triton XPU now.

We used to use:

the Torch legacy profiler with the IPEX extension. (This is going to be removed by IPEX)
The new torch profiler with the Kineto extended by IPEX. (This depends on the Kineto and Torch)
Use the synchronization wait on the host to measure the performance. (This is not accurate with host overheads.)

The Triton has a new component for profiling performance of the Triton kernel. It worth to support it for the Triton XPU.

tdeng5 commented 4 months ago

It is the highest priority for collecting accurate Triton performance data for the coming Triton Demo on Jun 25.

etiotto commented 4 months ago

I have added post review comments to the PR that closed this issue, see https://github.com/intel/intel-xpu-backend-for-triton/pull/1136#discussion_r1605220737.

I am concerned the benchmarks use a different way than the do_bench Triton uses to compute timing.

chengjunlu commented 2 months ago

Use the Torch legacy profiler with the IPEX extension to profile the Triton kernel. The changes has been merged. https://github.com/intel/intel-xpu-backend-for-triton/pull/1136

As the legacy profiler in public Torch doesn't support XPU, we need to use the Kineto to profile the Triton kernel. The alternative solution is to enable the Proton which is tracked in https://github.com/intel/intel-xpu-backend-for-triton/issues/1145

Change the title of this issue to be more precise.

chengjunlu commented 1 month ago

The Pytorch Kineto for XPU requires a separate PTI package intel-pti-dev_p_0.9.0.32 which is not included PTDB package so far.

I will try it with Triton kernel to see if it works properly.

chengjunlu commented 1 month ago

Add @ZzEeKkAa as assignee to this issue because he has already worked on the Kineto profiler integration to Triton.

Here is the PR #1905 from @ZzEeKkAa

ZzEeKkAa commented 1 month ago

Speaking of PTI, can we use it for the elapsed_time. We would unblock long running https://github.com/pytorch/pytorch/pull/126456?

vlad-penkin commented 1 month ago

@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:

Triton UT's
Triton Tutorial's
Torch Inductor UT's relevant to the Triton
PyTorch/Benchmark E2E tests

chengjunlu commented 1 month ago

Speaking of PTI, can we use it for the elapsed_time. We would unblock long running pytorch/pytorch#126456?

The elapsed_time is more general that can be used to profile the E2E GPU time including the bubble time which maybe caused by kernel scheduling bubble. I am not sure it is possible in PTI.

chengjunlu commented 1 month ago

@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:

Triton UT's

Triton Tutorial's

Torch Inductor UT's relevant to the Triton

PyTorch/Benchmark E2E tests

I combine all the information here for performance profiling.

There are two ways used in Torch + Triton to measure GPU kernel performance:

Diff the two events time stamp by elapsed_time.
Use the Kineto to profile kernel time thru PTI.

Note: Pytorch XPU 2.5 doesn't support elapsed_time and raises exception. The Triton XPU work around it approximately by using the wall time instead of GPU time stamp. I will mark the use cases with Triton Workaround for notice.

Here are the cases using 1st way:

Triton Tutorial's Triton Workaround
Triton UT's Triton Workaround
Torch Inductor UT's relevant to the Triton. Triton Workaround

Here are the cases using 2nd way:

PyTorch/Benchmark E2E tests

whitneywhtsang commented 1 month ago

Note: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh can be applied to allow using elapsed_time for the cases that specified above as using 1st way. It is used in CI and developer script.

chengjunlu commented 1 month ago

The Kineto is blocked by an issue in Intel PTI not able to trace the Triton kernel launched by SYCL API.

We can use the first way as a work around to get the approximate performance profiling with the patch https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh.

For the Pytorch 2.5 OOB supporting, we have to use the wall time as a work around for now.

The changes has been pushed to the PR https://github.com/intel/intel-xpu-backend-for-triton/pull/1905

intel / intel-xpu-backend-for-triton

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066