Surprised by the performance of Triton!

triton-lang / triton

Development repository for the Triton language and compiler

https://triton-lang.org/

MIT License

12.39k stars 1.5k forks source link

Surprised by the performance of Triton! #3747

Open sleepwalker2017 opened 4 months ago

sleepwalker2017 commented 4 months ago

I did a benchmark to check the time cost (us) of cublas and triton on various shapes, I find that triton kernel is faster than cublas on most of the times.

Is that a normal case? Anyone got the same result? Thank you !

The numbers in the sheet is the time cost (us), not peak computation throughput.

yunjiangster commented 2 months ago

Does cublas do any grid tuning? If not triton probably has an advantage.

sleepwalker2017 commented 2 months ago

Does cublas do any grid tuning? If not triton probably has an advantage.

https://github.com/triton-lang/triton/blob/main/python/tutorials/03-matrix-multiplication.py This is the script I use to do the benchmark. I modified it to return time cost instead of the throughput.

Does it use tuning? I think not. But seems most people use cublas this way?

sleepwalker2017 commented 1 month ago

Does cublas do any grid tuning? If not triton probably has an advantage.

I'm curious, does cublas need tuning? seems it will auto tune to find a best algorithm.

mnicely commented 1 month ago

cuBLAS relies on heuristics to find the best kernel based on the input parameters. Heuristics return the best kernels 90+% of the time.

You can autotune on top of this using cublasLtMatmulAlgoGetHeuristic, an example can be found here.

For more information, check out this developer blog.

sleepwalker2017 commented 1 month ago

blog

Thank you!

I read the cublas manual, it seems on SM80 and later GPUs, there is no need to tune gemm?

cublasGemmAlgo_t type is an enumerant to specify the algorithm for matrix-matrix multiplication on GPU architectures up to sm_75. On sm_80 and newer GPU architectures, this enumarant has no effect. cuBLAS has the following algorithm options:

mnicely commented 1 month ago

Hi @sleepwalker2017, sorry I dropped the ball on this.

What you're seeing from the manual is cuBLAS shifting efforts to cuBLASLt for power-usage of GEMMs a few year ago. This was for better flexibility and the addition of fusions.

Think of it this why cublasGemmAlgo_t + cublasGemmEx is for sm75 and below; cublasLtMatmulAlgoGetHeuristics is for sm80+.

sleepwalker2017 commented 1 month ago

Hi @sleepwalker2017, sorry I dropped the ball on this.

What you're seeing from the manual is cuBLAS shifting efforts to cuBLASLt for power-usage of GEMMs a few year ago. This was for better flexibility and the addition of fusions.

Think of it this why cublasGemmAlgo_t + cublasGemmEx is for sm75 and below; cublasLtMatmulAlgoGetHeuristics is for sm80+.

Thank you. But most llm inference frameworks invoke cublas directly without tuning. Seems there is large potential improvement here?