`AllocateSharedMemoryPass` has possibility to allocate SLM size greater than device max share memory

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

124 stars 35 forks source link

`AllocateSharedMemoryPass` has possibility to allocate SLM size greater than device max share memory #1716

Open LiyangLingIntel opened 1 month ago

LiyangLingIntel commented 1 month ago

Running gemm kernels like gemm_splitk_benchmark.py with the latest llvm-target branch will fail for

triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 266240, Hardware limit: 131072.

chengjunlu commented 1 month ago

Information from @whitneywhtsang

FYI: The failures reported in CI are due to enabling large 2d block load https://github.com/intel/intel-xpu-backend-for-triton/commit/a74da7d7c0a862940a9f6bcf21bfa5f55f608085.

LiyangLingIntel commented 1 week ago

This task is still under progress. This issue maybe partially cased by that we use larger size of dpas layout with repCluster while the allocation analysis algorithm uses NvidiaMmaLayout, the calculated buffer scratch size would be different. On the other hand, the original allocation algorithm does not seem to be quite perfect, it potentially allocates oversized shared memory. It needs more investigation to find an appropriate solution.