[GPU] Gather -> matmul fusion support

Note: Similar to https://github.com/iree-org/iree/issues/18447 but for matmul. We want to support fusing gather-like linalg.generic ops with matmul ops.

Problem

Due to the small tensor sizes (tensor<8x7x5xf32>), this example does not throw any errors due to excessive shared memory allocation. But inspecting the dump and/or using larger tensor sizes shows that each batch of the 'gathered' tensor is materialized aka 7x5xf32 (and codegen fails when using a larger vector size).

Another problem is that the LLVMGPUVectorize pipeline is being used. Apparently, either LLVMGPUVectorDistribute or igemm should be used instead.

IR/Logs

https://gist.github.com/IanWood1/2f6b5c6af9597d47efbd2506f0cc19b9 contains the executable sources & the original linalg IR.

Here is a dump of IR after each pass https://gist.githubusercontent.com/IanWood1/1c2bdb053a4929dca98c019768ffae41/raw/7ab58055d4be208e6cede980a13121dbbf49eac9/pre-gather-matmul.mlir.

cc @MaheshRavishankar

iree-org / iree

[GPU] Gather -> matmul fusion support #18457

Problem

IR/Logs