siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch
https://siboehm.com/articles/22/CUDA-MMM
MIT License
420 stars 55 forks source link

use tensor cores #2

Open MustafaFayez opened 1 year ago

MustafaFayez commented 1 year ago

Great repo! I learned a lot from it and the blog, thank you!

I was wondering if there is an easy way to rewrite the kernels to use tensor cores in volta and above archs.

Thanks.

siboehm commented 1 year ago

Tbh I'm not sure. I was wanting to give this a try at some point. The hierarchy of looping in this repo is largely taken from cutlass docs, which also supports tensor cores, so I assume the overall hierarchy would stay more or less the same.

MustafaFayez commented 1 year ago

Sorry, somehow I missed your comment, yes, I looked at the cutlass implementation and it is similar to yours. I like yours because it teaches beginners like me to learn how to optimize gemms step by step. I will keep following this repo in case you decide to implement the TC version later in the future.

I am also thinking about doing it myself, will comment here if I did.