Open LiyangLingIntel opened 1 month ago
Information from @whitneywhtsang
FYI: The failures reported in CI are due to enabling large 2d block load https://github.com/intel/intel-xpu-backend-for-triton/commit/a74da7d7c0a862940a9f6bcf21bfa5f55f608085.
This task is still under progress.
This issue maybe partially cased by that we use larger size of dpas layout with repCluster
while the allocation analysis algorithm uses NvidiaMmaLayout, the calculated buffer scratch size would be different.
On the other hand, the original allocation algorithm does not seem to be quite perfect, it potentially allocates oversized shared memory. It needs more investigation to find an appropriate solution.
Running gemm kernels like gemm_splitk_benchmark.py with the latest
llvm-target
branch will fail for