triton-lang / triton

Development repository for the Triton language and compiler
https://triton-lang.org/
MIT License
13.09k stars 1.6k forks source link

tl.dot for matrix size 32x8x16 (m-n-k) #3212

Open Begunner opened 7 months ago

Begunner commented 7 months ago

May tl.dot support mma 32x8x16 (m-n-k) which is supported by tensor core?

In the process of developing operators with Triton, it's essential to minimize the N dimension of blocks as much as possible, yet the smallest size supported by tl.dot is 16.

I've found a related comment from @jon-chuang . In the link, 32x8x16 mma is supported. May triton support it in later?

It seems reasonable that in this case, Triton would not use mma instructions, but rather ordinary FMA instructions. This, however, appears to be unimplemented. The list of supported sizes is here.

To my understanding, Triton also does not support optimizing other "edge-cases" when it comes to dot perf, for instance tall-and-skinny matmuls.

Originally posted by @jon-chuang in https://github.com/openai/triton/issues/2266#issuecomment-1719002471

jlebar commented 7 months ago

I don't see a reason not to support this, but like many features in Triton, it may be in a "patches welcome" situation until and unless one of the Triton maintainers needs this feature themselves.