[Performance] Setup a microbenchmark to monitor non-gemm kernel performance

LiyangLingIntel commented 4 months ago

We need a Microbenchmark to check the performance regularly, guarantee there is no huge regression after some changes. Currently we already have 130+ triton none-gemm kernels extracted from pytorch E2E models: https://github.com/intel/intel-xpu-backend-for-triton/tree/liyang/micro-benchmark/benchmark/inductor_kernels

There are several points that need to be decided:

The scope of the benchmark kernel number, kernel categories, ...
The baseline who will the benchmark perf data compare to. This should have an agreement with pytorch team.
The automated checking strategy.

LiyangLingIntel commented 4 months ago

Refer to benchmark/scripts to extract kernels from PyTorch E2E models. The following task is to analysis the extracted kernels, set scope of the microbenchmark cases.

tdeng5 commented 2 months ago

We need to provide a powerful compare method to confirm Triton's performance is good. like: E2E/kernels compare to PVC + IPEX/NV platform,

LiyangLingIntel commented 2 months ago

We need to provide a powerful compare method to confirm Triton's performance is good. like: E2E/kernels compare to PVC + IPEX/NV platform,

Yes. Given the fused Triton kernels from PyTorch E2E models, to setup the benchmark, we should decide

a convincing methodology to monitor kernel performance For PyTorch models, they calculate the speedup ratio, which is graph mode (using Triton) / eager mode, then compare speedup ratio gap towards CUDA. But for Triton kernel only, it's hard to map Inductor generated Triton kernel to eager mode operations. It's unrealistic to reproduce each Triton kernel on other libraries like oneDNN, XeTLA, to do the comparison on the same Intel platform. So, an initial proposal in my mind is, comparing the kernel performance with CUDA. We can make categories on kernels, for memory-bound, computation-bound ... kernels, we apply suitable scaling factors onto performance data to narrow down the hardware differences. This needs more discussion.
the target and plan on the execution We should know how much performance data is our target? the scope of kernels? will the kernel be change with PyTorch update? Do kernel analysis would be effort-taking, we don't want to do it back and forth.

When above points are resolved, maybe we can consider add this non-gemm benchmark to https://github.com/intel/intel-xpu-backend-for-triton/issues/879 and tracked by CI or nightly.

vlad-penkin commented 2 months ago

This microbenchmark purpose is to check performance regressions.

vlad-penkin commented 2 months ago

Split this ticket into the two: base (existing kernels - softmax) and additions (generalized / standalone reduction and atomic add kernels).

LiyangLingIntel commented 2 months ago

Split this ticket into the two: base (existing kernels - softmax) and additions (generalized / standalone reduction and atomic add kernels).

2 issues are filed to track these 2 step:

tdeng5 commented 2 months ago

Propose to set the issue as an umbrella issue, will create some sub-issue to track details.

intel / intel-xpu-backend-for-triton

[Performance] Setup a microbenchmark to monitor non-gemm kernel performance #592