08-grouped-gemm.py poor performance

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

137 stars 42 forks source link

08-grouped-gemm.py poor performance #348

Closed prathams417 closed 2 months ago

prathams417 commented 8 months ago

Current output of test 11:

group-gemm-performance:
        N   cuBLAS        Triton
0   128.0  0.11488  276574.06250
1   256.0  0.12080  276332.68750
2   512.0  0.14360  276066.46875
3  1024.0  0.21360  275607.40625

Triton performance should be similar to cuBLAS

python3 -m pip install matplotlib pandas tabulate -q
python3 python/tutorials/11-grouped-gemm.py

etiotto commented 8 months ago

The triton kernel contains tl.dot which currently is lowered to a loop containing scalar FMA instructions. To get better performance we need to lower tl.dot to use DPAS instructions. We have a work item for that but is not yet complete. So we should revisit the performance once that work is complete.

anmyachev commented 2 months ago

@whitneywhtsang I don't see tutorial 11-grouped-gemm.py. Am I correct in understanding that its current name is: 08-grouped-gemm.py?

whitneywhtsang commented 2 months ago

@whitneywhtsang I don't see tutorial 11-grouped-gemm.py. Am I correct in understanding that its current name is: 08-grouped-gemm.py?

Yes, it is renamed to 08-grouped-gemm.py.

anmyachev commented 2 months ago

I got the following numbers on PVC. @whitneywhtsang @etiotto can we consider the performance sufficient and close the issue?

~/intel-xpu-backend-for-triton$ python python/tutorials/08-grouped-gemm.py 
(I): Detected 37312 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 26944 spills
(I): Detected 37312 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 26944 spills
(I): Detected 12800 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
(I): Detected 12800 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
group-gemm-performance:
        N   cuBLAS   Triton
0   128.0  0.02080  0.02640
1   256.0  0.02688  0.04480
2   512.0  0.04256  0.14976
3  1024.0  0.11200  1.00160

whitneywhtsang commented 2 months ago

can we consider the performance sufficient and close the issue?

I think so.