pytorch-labs / applied-ai

Applied AI experiments and examples for PyTorch
BSD 3-Clause "New" or "Revised" License
141 stars 12 forks source link

Reproduce post numbers #23

Open ericheatspersimmons opened 5 months ago

ericheatspersimmons commented 5 months ago

Can you share the hyperparameter settings you used for various problem sizes? With the defaults in the repo, I get about 350us for the M=1, N,K=8192 case. Almost 10x slower than reported.

(This is after adding cuda graphs to the example for benchmarking purposes. And also moving the creation of the C matrix outside the gemm_split_k function call.)

ericheatspersimmons commented 5 months ago

A preliminary search suggests that:

block_m: 16, block_n: 512, block_k: 32, num_stages: 2, num_warps: 4, split_k: 16, group_m: 4 is a reasonable guess? But it's still significantly slower than reported. I get ~60us and Bandwidth: 1100 GB/s. Reported is close to 40us and 1600 GB/sec.

ericheatspersimmons commented 5 months ago

Upgrading to nearly HEAD of triton (nightly from 4/24) seems to have made things slower. Can you reproduce the numbers in the blog at current HEAD of triton?