siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch
https://siboehm.com/articles/22/CUDA-MMM
MIT License
410 stars 53 forks source link

How to change the autotune setting for kernel 9? #6

Open ghostplant opened 4 months ago

ghostplant commented 4 months ago

I just get 15TFlops on A100 (sm80), and 6TFlops on 2080ti (sm75).

If tuning properly, it should be able to get > 17TFlops for A100 and > 12Tflops for 2080ti, right?

siboehm commented 4 months ago

in scripts/ there's the script I used for autotuning, feel free to try that

ghostplant commented 4 months ago

Now I get 11Tflops for 2080ti, and 17Tflops for A100, is that reasonable?

siboehm commented 4 months ago

seems pretty reasonable to me. Depends on which A100 you have. In the post I quote some numbers that I got.