tl.dot for matrix size 32x8x16 (m-n-k)

May tl.dot support mma 32x8x16 (m-n-k) which is supported by tensor core?

In the process of developing operators with Triton, it's essential to minimize the N dimension of blocks as much as possible, yet the smallest size supported by tl.dot is 16.

I've found a related comment from @jon-chuang . In the link, 32x8x16 mma is supported. May triton support it in later?

It seems reasonable that in this case, Triton would not use mma instructions, but rather ordinary FMA instructions. This, however, appears to be unimplemented. The list of supported sizes is here.

To my understanding, Triton also does not support optimizing other "edge-cases" when it comes to dot perf, for instance tall-and-skinny matmuls.

Originally posted by @jon-chuang in https://github.com/openai/triton/issues/2266#issuecomment-1719002471

triton-lang / triton

tl.dot for matrix size 32x8x16 (m-n-k) #3212