Open ericheatspersimmons opened 5 months ago
A preliminary search suggests that:
block_m: 16, block_n: 512, block_k: 32, num_stages: 2, num_warps: 4, split_k: 16, group_m: 4 is a reasonable guess? But it's still significantly slower than reported. I get ~60us and Bandwidth: 1100 GB/s. Reported is close to 40us and 1600 GB/sec.
Upgrading to nearly HEAD of triton (nightly from 4/24) seems to have made things slower. Can you reproduce the numbers in the blog at current HEAD of triton?
Can you share the hyperparameter settings you used for various problem sizes? With the defaults in the repo, I get about 350us for the M=1, N,K=8192 case. Almost 10x slower than reported.
(This is after adding cuda graphs to the example for benchmarking purposes. And also moving the creation of the C matrix outside the gemm_split_k function call.)