iree.abi.output does not work as expected for complex<f32>

What happened?

Input module.mlir:

func.func @add_20x20xcomplex64_20x20xcomplex64_20x20xcomplex64(%arg0: tensor<20x20xcomplex<f32>>, %arg1: tensor<20x20xcomplex<f32>>, %arg2: tensor<20x20xcomplex<f32>> {iree.abi.output = 0 : index}) -> tensor<20x20xcomplex<f32>> {
  %0 = stablehlo.add %arg0, %arg1 : tensor<20x20xcomplex<f32>>
  return %0 : tensor<20x20xcomplex<f32>>
}

compile command:

iree-compile --iree-hal-target-backends=llvm-cpu --iree-input-demote-i64-to-i32=false --iree-input-demote-f64-to-f32=false --iree-opt-demote-f64-to-f32=false module.mlir -o module.vmfb

dump: https://gist.github.com/okkwon/a072d670831ba80580481db9aece2184

The same op with a different type, e.g., f16 works as expected. (No stream.resource.alloca.)

dump for f16: https://gist.github.com/okkwon/389e2dda53cc4bd69c0a51b0e53c68a6

One biggest difference is that for complex, there are multiple ops generated, and the ops are not inlined into the top level function with iree.hal.buffer. iree.abi.output is not honored during the private function lowering. For example, IPO removes the operand.

iree-org / iree

iree.abi.output does not work as expected for complex<f32> #15652

What happened?