Closed prathams417 closed 2 months ago
The triton kernel contains tl.dot
which currently is lowered to a loop containing scalar FMA instructions. To get better performance we need to lower tl.dot
to use DPAS instructions. We have a work item for that but is not yet complete. So we should revisit the performance once that work is complete.
@whitneywhtsang I don't see tutorial 11-grouped-gemm.py
. Am I correct in understanding that its current name is: 08-grouped-gemm.py
?
@whitneywhtsang I don't see tutorial
11-grouped-gemm.py
. Am I correct in understanding that its current name is:08-grouped-gemm.py
?
Yes, it is renamed to 08-grouped-gemm.py
.
I got the following numbers on PVC. @whitneywhtsang @etiotto can we consider the performance sufficient and close the issue?
~/intel-xpu-backend-for-triton$ python python/tutorials/08-grouped-gemm.py
(I): Detected 37312 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 26944 spills
(I): Detected 37312 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 26944 spills
(I): Detected 12800 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
(I): Detected 12800 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
group-gemm-performance:
N cuBLAS Triton
0 128.0 0.02080 0.02640
1 256.0 0.02688 0.04480
2 512.0 0.04256 0.14976
3 1024.0 0.11200 1.00160
can we consider the performance sufficient and close the issue?
I think so.
Current output of test 11:
Triton performance should be similar to cuBLAS