May tl.dot support mma 32x8x16 (m-n-k) which is supported by tensor core?
In the process of developing operators with Triton, it's essential to minimize the N dimension of blocks as much as possible, yet the smallest size supported by tl.dot is 16.
I've found a related comment from @jon-chuang . In the link, 32x8x16 mma is supported. May triton support it in later?
It seems reasonable that in this case, Triton would not use mma instructions, but rather ordinary FMA instructions. This, however, appears to be unimplemented. The list of supported sizes is here.
To my understanding, Triton also does not support optimizing other "edge-cases" when it comes to dot perf, for instance tall-and-skinny matmuls.
I don't see a reason not to support this, but like many features in Triton, it may be in a "patches welcome" situation until and unless one of the Triton maintainers needs this feature themselves.
May tl.dot support mma 32x8x16 (m-n-k) which is supported by tensor core?
In the process of developing operators with Triton, it's essential to minimize the N dimension of blocks as much as possible, yet the smallest size supported by tl.dot is 16.
I've found a related comment from @jon-chuang . In the link, 32x8x16 mma is supported. May triton support it in later?
Originally posted by @jon-chuang in https://github.com/openai/triton/issues/2266#issuecomment-1719002471