mbrookhart commented 19 hours ago

While working on some higher dimension tensor kernels, I noticed poor performance due to the fact that layouts wouldn't propagate to local loads. Since we do allow layout folding with local store and local alloc, this seems like a bit of an oversight.

The change gives a 40% speed improvement on certain kernels for NVidia GPUs.

This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice.

New contributor declaration

[x] I am not making a trivial change, such as fixing a typo in a comment.
[x] I have written a PR description following these rules.
[x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
[x] I have added tests.
[x] The lit tests I have added follow these best practices

Jokeren commented 19 hours ago

This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice.

Please retain these asserts for now. There are some known issues with 3d convert layout.

For layout propagation itself, I'll defer it to @ThomasRaoux

ThomasRaoux commented 17 hours ago

This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice.

Please retain these asserts for now. There are some known issues with 3d convert layout.

For layout propagation itself, I'll defer it to @ThomasRaoux

Good to know. @mbrookhart can you separate it out for now? Someone can help figure out the problems

mbrookhart commented 17 hours ago

I put the asserts back in and added the requested checks to the mlir. Thanks @ThomasRaoux @Jokeren !

triton-lang / triton

Allow Layouts to propogate to local_load #5219

New contributor declaration