Matmul dispatch failing to compile due to shared memory allocation

What happened?

Matmul dispatch is over allocating to shared memory on MI300X target and failing to compile with the following error:

./vae_decomp_f32_dps/compiled_vae_decode$async_dispatch_18.mlir:9:6: error: 'func.func' op uses 86016 bytes of shared memory; exceeded the limit of 65536 bytes func.func @decode$async_dispatch_18_matmul_transpose_b_16384x512x512_f32() { ^ ./vae_decomp_f32_dps/compiled_vae_decode$async_dispatch_18.mlir:2:2: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute = fp64|fp32|fp16|int64|int32|int16|int8, storage = b64|b32|b16|b8, subgroup = shuffle|arithmetic, dot = dp4xi8toi32, mma = [, ], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}> hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute = fp64|fp32|fp16|int64|int32|int16|int8, storage = b64|b32|b16|b8, subgroup = shuffle|arithmetic, dot = dp4xi8toi32, mma = [, ], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}>) { ^ failed to translate executables

Steps to reproduce your issue

compile command: iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-rocm-waves-per-eu=2 --iree-flow-enable-aggressive-fusion --iree-codegen-llvmgpu-use-vector-distribution=false --iree-global-opt-propagate-transposes=true --iree-opt-const-eval=false --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-flow-inline-constants-max-byte-length=1 --iree-preprocessing-pass-pipeline="builtin.module(iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pad-to-intrinsics))" vae_decomp_f32_dps/configured_compiled_vae_decode\$async_dispatch_18.mlir -o vae.vmfb

input dispatch IR: https://gist.github.com/IanNod/283d68f9aea0dcb50e94d2b2820bbb21

What component(s) does this issue relate to?

Compiler

Version information

c1e542d6370473244a8fa9178615cb8a6041b489

Additional context

No response

iree-org / iree