Inconsistency between iree-compile and standalone torch-mlir-opt compile

What happened?

Inconsistency found when lowering Inception_v4_vaiq_int8 model https://github.com/nod-ai/SHARK-TestSuite/issues/190

Passed: standalone torch-mlir-opt + iree: onnx -> torch -> linalg -> vmfb

/home/chi/src/torch-mlir/build/bin/torch-mlir-opt -pass-pipeline='builtin.module(func.func(convert-torch-onnx-to-torch),torch-lower-to-backend-contract,func.func(cse,canonicalize),torch-backend-to-linalg-on-tensors-backend-pipeline)' Inception_v4_vaiq_int8.default.torch-onnx.mlir > Inception_v4_vaiq_int8.default.onnx.linalg.mlir

/home/chi/src/iree-build/tools/iree-compile --iree-input-demote-i64-to-i32 --iree-hal-target-backends=llvm-cpu  Inception_v4_vaiq_int8.default.onnx.linalg.mlir > Inception_v4_vaiq_int8.default.vmfb

/home/chi/src/iree-build/tools/iree-run-module --module=Inception_v4_vaiq_int8.default.vmfb --input="32x3x224x224xf32=@inference_input.0.bin"  --output=@inference_output.0.bin  --output=@inference_output.1.bin  --output=@inference_output.2.bin  --output=@inference_output.3.bin  --output=@inference_output.4.bin  --output=@inference_output.5.bin  --output=@inference_output.6.bin  --output=@inference_output.7.bin  --output=@inference_output.8.bin  --output=@inference_output.9.bin  --output=@inference_output.10.bin  --output=@inference_output.11.bin  --output=@inference_output.12.bin  --output=@inference_output.13.bin  --output=@inference_output.14.bin  --output=@inference_output.15.bin  --output=@inference_output.16.bin  --output=@inference_output.17.bin  --output=@inference_output.18.bin  --output=@inference_output.19.bin  --output=@inference_output.20.bin  --output=@inference_output.21.bin  --output=@inference_output.22.bin  --output=@inference_output.23.bin  --output=@inference_output.24.bin  --output=@inference_output.25.bin  --output=@inference_output.26.bin  --output=@inference_output.27.bin  --output=@inference_output.28.bin  --output=@inference_output.29.bin  --output=@inference_output.30.bin  --output=@inference_output.31.bin

Failed: iree: onnx -> vmfb:

/home/chi/src/iree-build/tools/iree-compile --iree-input-demote-i64-to-i32 --iree-hal-target-backends=llvm-cpu  Inception_v4_vaiq_int8.default.onnx.torch.mlir > Inception_v4_vaiq_int8.default.vmfb

Failed log:

failed to translate executables
Inception_v4_vaiq_int8.default.onnx.torch.mlir:403:12: error: One or more operations with large vector sizes (8192 bytes) were found:

    %399 = torch.operator "onnx.Relu"(%398) : (!torch.vtensor<[32,192,25,25],f32>) -> !torch.vtensor<[32,192,25,25],f32> 
           ^
<unknown>:0: note:   %cst_0 = arith.constant dense<1.250000e-01> : vector<1x192x3x5xf32>

Inception_v4_vaiq_int8.default.onnx.torch.mlir:397:12: note:   %21 = arith.extsi %20 : vector<1x192x3x5xi8> to vector<1x192x3x5xi32>

    %393 = torch.operator "onnx.DequantizeLinear"(%392, %303, %301) : (!torch.vtensor<[32,192,52,52],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[32,192,52,52],f32> 
           ^
Inception_v4_vaiq_int8.default.onnx.torch.mlir:397:12: note:   %22 = arith.sitofp %21 : vector<1x192x3x5xi32> to vector<1x192x3x5xf32>

Inception_v4_vaiq_int8.default.onnx.torch.mlir:397:12: note:   %23 = arith.mulf %22, %cst_0 : vector<1x192x3x5xf32>

Inception_v4_vaiq_int8.default.onnx.torch.mlir:397:12: note:   %24 = vector.transfer_write %23, %18[%c0, %c0, %c0, %c0], %19 {in_bounds = [true, true, true, true]} : vector<1x192x3x5xf32>, tensor<1x192x3x?xf32>

Inception_v4_vaiq_int8.default.onnx.torch.mlir:403:12: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
    %399 = torch.operator "onnx.Relu"(%398) : (!torch.vtensor<[32,192,25,25],f32>) -> !torch.vtensor<[32,192,25,25],f32> 
           ^
Inception_v4_vaiq_int8.default.onnx.torch.mlir:403:12: note: see current operation: 
"hal.executable.variant"() ({
  "hal.executable.export"() ({
  ^bb0(%arg18: !hal.device):
    %72 = "arith.constant"() <{value = 12 : index}> : () -> index
    %73 = "arith.constant"() <{value = 1 : index}> : () -> index
    "hal.return"(%72, %73, %73) : (index, index, index) -> ()
  }) {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>], layout = #hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>, ordinal = 0 : index, sym_name = "torch_jit$async_dispatch_19_conv_2d_nchw_fchw_32x192x25x25x192x3x3_f32"} : () -> ()
  "builtin.module"() ({
    "func.func"() <{function_type = () -> (), sym_name = "torch_jit$async_dispatch_19_conv_2d_nchw_fchw_32x192x25x25x192x3x3_f32"}> ({
      %0 = "arith.constant"() <{value = dense<0.000000e+00> : vector<1x2x1x2xf32>}> : () -> vector<1x2x1x2xf32>
      %1 = "arith.constant"() <{value = dense<1.250000e-01> : vector<1x192x3x5xf32>}> : () -> vector<1x192x3x5xf32>
      %2 = "arith.constant"() <{value = 0 : i8}> : () -> i8
      %3 = "arith.constant"() <{value = 8 : index}> : () -> index
      %4 = "arith.constant"() <{value = 3 : index}> : () -> index
      %5 = "arith.constant"() <{value = 25 : index}> : () -> index
      %6 = "arith.constant"() <{value = 2 : index}> : () -> index
      %7 = "arith.constant"() <{value = 16 : index}> : () -> index
      %8 = "arith.constant"() <{value = 1 : index}> : () -> index
      %9 = "arith.constant"() <{value = 32 : index}> : () -> index
      %10 = "arith.constant"() <{value = 192 : index}> : () -> index
      %11 = "arith.constant"() <{value = 0.000000e+00 : f32}> : () -> f32
      %12 = "arith.constant"() <{value = 0 : index}> : () -> index
      %13 = "arith.constant"() <{value = 145961472 : index}> : () -> index
      %14 = "arith.constant"() <{value = 147288576 : index}> : () -> index
      %15 = "arith.constant"() <{value = 16613376 : index}> : () -> index
      %16 = "hal.interface.binding.subspan"(%12) {alignment = 64 : index, binding = 0 : index, descriptor_flags = 1 : i32, descriptor_type = #hal.descriptor_type<storage_buffer>, operandSegmentSizes = array<i32: 1, 0>, set = 0 : index} : (index) -> !flow.dispatch.tensor<readonly:tensor<32x192x52x52xi8>>
      %17 = "hal.interface.binding.subspan"(%13) {alignment = 64 : index, binding = 1 : index, descriptor_flags = 1 : i32, descriptor_type = #hal.descriptor_type<storage_buffer>, operandSegmentSizes = array<i32: 1, 0>, set = 0 : index} : (index) -> !flow.dispatch.tensor<readonly:tensor<192x192x3x3xf32>>
      %18 = "hal.interface.binding.subspan"(%14) {alignment = 64 : index, binding = 1 : index, descriptor_flags = 1 : i32, descriptor_type = #hal.descriptor_type<storage_buffer>, operandSegmentSizes = array<i32: 1, 0>, set = 0 : index} : (index) -> !flow.dispatch.tensor<readonly:tensor<192xf32>>
      %19 = "hal.interface.binding.subspan"(%15) {alignment = 64 : index, binding = 2 : index, descriptor_type = #hal.descriptor_type<storage_buffer>, operandSegmentSizes = array<i32: 1, 0>, set = 0 : index} : (index) -> !flow.dispatch.tensor<writeonly:tensor<32x192x25x25xf32>>
      %20 = "hal.interface.workgroup.id"() {dimension = 0 : index} : () -> index
      %21 = "hal.interface.workgroup.count"() {dimension = 0 : index} : () -> index
      %22 = "affine.apply"(%20) <{map = affine_map<()[s0] -> (s0 * 16)>}> : (index) -> index
      %23 = "affine.apply"(%21) <{map = affine_map<()[s0] -> (s0 * 16)>}> : (index) -> index
      %24 = "flow.dispatch.tensor.load"(%16) <{operandSegmentSizes = array<i32: 1, 0, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 32, 192, 51, 51>, static_strides = array<i64: 1, 1, 1, 1>}> : (!flow.dispatch.tensor<readonly:tensor<32x192x52x52xi8>>) -> tensor<32x192x51x51xi8>
      "scf.for"(%22, %10, %23) ({
      ^bb0(%arg0: index):
        %25 = "flow.dispatch.tensor.load"(%19, %arg0) <{operandSegmentSizes = array<i32: 1, 0, 1, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808, 0, 0>, static_sizes = array<i64: 32, 16, 25, 25>, static_strides = array<i64: 1, 1, 1, 1>}> : (!flow.dispatch.tensor<writeonly:tensor<32x192x25x25xf32>>, index) -> tensor<32x16x25x25xf32>
        %26 = "flow.dispatch.tensor.load"(%17, %arg0) <{operandSegmentSizes = array<i32: 1, 0, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 16, 192, 3, 3>, static_strides = array<i64: 1, 1, 1, 1>}> : (!flow.dispatch.tensor<readonly:tensor<192x192x3x3xf32>>, index) -> tensor<16x192x3x3xf32>
        %27 = "flow.dispatch.tensor.load"(%18, %arg0) <{operandSegmentSizes = array<i32: 1, 0, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808>, static_sizes = array<i64: 16>, static_strides = array<i64: 1>}> : (!flow.dispatch.tensor<readonly:tensor<192xf32>>, index) -> tensor<16xf32>
        %28 = "scf.for"(%12, %9, %8, %25) ({
        ^bb0(%arg1: index, %arg2: tensor<32x16x25x25xf32>):
          %29 = "scf.for"(%12, %7, %6, %arg2) ({
          ^bb0(%arg3: index, %arg4: tensor<32x16x25x25xf32>):
            %30 = "tensor.extract_slice"(%26, %arg3) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 2, 192, 3, 3>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<16x192x3x3xf32>, index) -> tensor<2x192x3x3xf32>
            %31 = "scf.for"(%12, %5, %8, %arg4) ({
            ^bb0(%arg5: index, %arg6: tensor<32x16x25x25xf32>):
              %32 = "affine.apply"(%arg5) <{map = affine_map<(d0) -> (d0 * 2)>}> : (index) -> index
              %33 = "scf.for"(%12, %5, %6, %arg6) ({
              ^bb0(%arg7: index, %arg8: tensor<32x16x25x25xf32>):
                %34 = "affine.min"(%arg7) <{map = affine_map<(d0) -> (-d0 + 25, 2)>}> : (index) -> index
                %35 = "affine.apply"(%arg7) <{map = affine_map<(d0) -> (d0 * 2)>}> : (index) -> index
                %36 = "affine.apply"(%34) <{map = affine_map<(d0) -> (d0 * 2 + 1)>}> : (index) -> index
                %37 = "tensor.extract_slice"(%24, %arg1, %32, %35, %36) <{operandSegmentSizes = array<i32: 1, 3, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 192, 3, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<32x192x51x51xi8>, index, index, index, index) -> tensor<1x192x3x?xi8>
                %38 = "tensor.empty"(%36) : (index) -> tensor<1x192x3x?xf32>
                %39 = "vector.create_mask"(%8, %10, %4, %36) : (index, index, index, index) -> vector<1x192x3x5xi1>
                %40 = "vector.transfer_read"(%37, %12, %12, %12, %12, %2, %39) <{in_bounds = [true, true, true, true], operandSegmentSizes = array<i32: 1, 4, 1, 1>, permutation_map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>}> : (tensor<1x192x3x?xi8>, index, index, index, index, i8, vector<1x192x3x5xi1>) -> vector<1x192x3x5xi8>
                %41 = "arith.extsi"(%40) : (vector<1x192x3x5xi8>) -> vector<1x192x3x5xi32>
                %42 = "arith.sitofp"(%41) : (vector<1x192x3x5xi32>) -> vector<1x192x3x5xf32>
                %43 = "arith.mulf"(%42, %1) <{fastmath = #arith.fastmath<none>}> : (vector<1x192x3x5xf32>, vector<1x192x3x5xf32>) -> vector<1x192x3x5xf32>
                %44 = "vector.transfer_write"(%43, %38, %12, %12, %12, %12, %39) <{in_bounds = [true, true, true, true], operandSegmentSizes = array<i32: 1, 1, 4, 1>, permutation_map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>}> : (vector<1x192x3x5xf32>, tensor<1x192x3x?xf32>, index, index, index, index, vector<1x192x3x5xi1>) -> tensor<1x192x3x?xf32>
                %45 = "tensor.extract_slice"(%arg8, %arg1, %arg3, %arg5, %arg7, %34) <{operandSegmentSizes = array<i32: 1, 4, 1, 0>, static_offsets = array<i64: -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 2, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<32x16x25x25xf32>, index, index, index, index, index) -> tensor<1x2x1x?xf32>
                %46 = "vector.create_mask"(%8, %6, %8, %34) : (index, index, index, index) -> vector<1x2x1x2xi1>
                %47 = "vector.transfer_write"(%0, %45, %12, %12, %12, %12, %46) <{in_bounds = [true, true, true, true], operandSegmentSizes = array<i32: 1, 1, 4, 1>, permutation_map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>}> : (vector<1x2x1x2xf32>, tensor<1x2x1x?xf32>, index, index, index, index, vector<1x2x1x2xi1>) -> tensor<1x2x1x?xf32>
                %48 = "affine.apply"(%34) <{map = affine_map<(d0) -> (d0 * 2 - 1)>}> : (index) -> index
                %49 = "tensor.extract_slice"(%47, %34) <{operandSegmentSizes = array<i32: 1, 0, 1, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 2, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x2x1x?xf32>, index) -> tensor<1x2x1x?xf32>
                %50 = "tensor.extract_slice"(%49, %34) <{operandSegmentSizes = array<i32: 1, 0, 1, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 2, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x2x1x?xf32>, index) -> tensor<1x2x?xf32>
                %51 = "scf.for"(%12, %10, %3, %50) ({
                ^bb0(%arg9: index, %arg10: tensor<1x2x?xf32>):
                  %63 = "scf.for"(%12, %4, %8, %arg10) ({
                  ^bb0(%arg11: index, %arg12: tensor<1x2x?xf32>):
                    %64 = "scf.for"(%12, %4, %8, %arg12) ({
                    ^bb0(%arg13: index, %arg14: tensor<1x2x?xf32>):
                      %65 = "tensor.extract_slice"(%44, %arg9, %arg11, %arg13, %48) <{operandSegmentSizes = array<i32: 1, 3, 1, 0>, static_offsets = array<i64: 0, -9223372036854775808, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 8, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x192x3x?xf32>, index, index, index, index) -> tensor<1x8x1x?xf32>
                      %66 = "tensor.extract_slice"(%30, %arg9, %arg11, %arg13) <{operandSegmentSizes = array<i32: 1, 3, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 2, 8, 1, 1>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x192x3x3xf32>, index, index, index) -> tensor<2x8x1x1xf32>
                      %67 = "tensor.extract_slice"(%65, %48) <{operandSegmentSizes = array<i32: 1, 0, 1, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 8, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x8x1x?xf32>, index) -> tensor<1x8x?xf32>
                      %68 = "tensor.extract_slice"(%66) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 2, 8, 1, 1>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x8x1x1xf32>) -> tensor<2x8x1xf32>
                      %69 = "linalg.conv_1d_ncw_fcw"(%67, %68, %arg14) <{dilations = dense<1> : vector<1xi64>, operandSegmentSizes = array<i32: 2, 1>, strides = dense<2> : vector<1xi64>}> ({
                      ^bb0(%arg15: f32, %arg16: f32, %arg17: f32):
                        %70 = "arith.mulf"(%arg15, %arg16) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
                        %71 = "arith.addf"(%arg17, %70) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
                        "linalg.yield"(%71) : (f32) -> ()
                      }) {linalg.memoized_indexing_maps = [affine_map<(d0, d1, d2, d3, d4) -> (d0, d3, d2 * 2 + d4)>, affine_map<(d0, d1, d2, d3, d4) -> (d1, d3, d4)>, affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2)>]} : (tensor<1x8x?xf32>, tensor<2x8x1xf32>, tensor<1x2x?xf32>) -> tensor<1x2x?xf32>
                      "scf.yield"(%69) : (tensor<1x2x?xf32>) -> ()
                    }) : (index, index, index, tensor<1x2x?xf32>) -> tensor<1x2x?xf32>
                    "scf.yield"(%64) : (tensor<1x2x?xf32>) -> ()
                  }) : (index, index, index, tensor<1x2x?xf32>) -> tensor<1x2x?xf32>
                  "scf.yield"(%63) : (tensor<1x2x?xf32>) -> ()
                }) : (index, index, index, tensor<1x2x?xf32>) -> tensor<1x2x?xf32>
                %52 = "tensor.insert_slice"(%51, %49, %34) <{operandSegmentSizes = array<i32: 1, 1, 0, 1, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 2, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x2x?xf32>, tensor<1x2x1x?xf32>, index) -> tensor<1x2x1x?xf32>
                %53 = "tensor.insert_slice"(%52, %47, %34) <{operandSegmentSizes = array<i32: 1, 1, 0, 1, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 2, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x2x1x?xf32>, tensor<1x2x1x?xf32>, index) -> tensor<1x2x1x?xf32>
                %54 = "vector.transfer_read"(%27, %arg3, %11) <{in_bounds = [true], operandSegmentSizes = array<i32: 1, 1, 1, 0>, permutation_map = affine_map<(d0) -> (d0)>}> : (tensor<16xf32>, index, f32) -> vector<2xf32>
                %55 = "vector.broadcast"(%54) : (vector<2xf32>) -> vector<1x1x2x2xf32>
                %56 = "vector.transpose"(%55) <{permutation = array<i64: 0, 3, 1, 2>}> : (vector<1x1x2x2xf32>) -> vector<1x2x1x2xf32>
                %57 = "vector.transfer_read"(%53, %12, %12, %12, %12, %11, %46) <{in_bounds = [true, true, true, true], operandSegmentSizes = array<i32: 1, 4, 1, 1>, permutation_map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>}> : (tensor<1x2x1x?xf32>, index, index, index, index, f32, vector<1x2x1x2xi1>) -> vector<1x2x1x2xf32>
                %58 = "arith.addf"(%57, %56) <{fastmath = #arith.fastmath<none>}> : (vector<1x2x1x2xf32>, vector<1x2x1x2xf32>) -> vector<1x2x1x2xf32>
                %59 = "arith.cmpf"(%58, %0) <{fastmath = #arith.fastmath<none>, predicate = 9 : i64}> : (vector<1x2x1x2xf32>, vector<1x2x1x2xf32>) -> vector<1x2x1x2xi1>
                %60 = "arith.select"(%59, %58, %0) : (vector<1x2x1x2xi1>, vector<1x2x1x2xf32>, vector<1x2x1x2xf32>) -> vector<1x2x1x2xf32>
                %61 = "vector.transfer_write"(%60, %53, %12, %12, %12, %12, %46) <{in_bounds = [true, true, true, true], operandSegmentSizes = array<i32: 1, 1, 4, 1>, permutation_map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>}> : (vector<1x2x1x2xf32>, tensor<1x2x1x?xf32>, index, index, index, index, vector<1x2x1x2xi1>) -> tensor<1x2x1x?xf32>
                %62 = "tensor.insert_slice"(%61, %arg8, %arg1, %arg3, %arg5, %arg7, %34) <{operandSegmentSizes = array<i32: 1, 1, 4, 1, 0>, static_offsets = array<i64: -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 2, 1, -9223372036854775808>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x2x1x?xf32>, tensor<32x16x25x25xf32>, index, index, index, index, index) -> tensor<32x16x25x25xf32>
                "scf.yield"(%62) : (tensor<32x16x25x25xf32>) -> ()
              }) : (index, index, index, tensor<32x16x25x25xf32>) -> tensor<32x16x25x25xf32>
              "scf.yield"(%33) : (tensor<32x16x25x25xf32>) -> ()
            }) : (index, index, index, tensor<32x16x25x25xf32>) -> tensor<32x16x25x25xf32>
            "scf.yield"(%31) : (tensor<32x16x25x25xf32>) -> ()
          }) : (index, index, index, tensor<32x16x25x25xf32>) -> tensor<32x16x25x25xf32>
          "scf.yield"(%29) : (tensor<32x16x25x25xf32>) -> ()
        }) : (index, index, index, tensor<32x16x25x25xf32>) -> tensor<32x16x25x25xf32>
        "flow.dispatch.tensor.store"(%28, %19, %arg0) <{operandSegmentSizes = array<i32: 1, 1, 0, 1, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808, 0, 0>, static_sizes = array<i64: 32, 16, 25, 25>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<32x16x25x25xf32>, !flow.dispatch.tensor<writeonly:tensor<32x192x25x25xf32>>, index) -> ()
        "scf.yield"() : () -> ()
      }) : (index, index, index) -> ()
      "func.return"() : () -> ()
    }) {translation_info = #iree_codegen.translation_info<CPUConvTileAndDecomposeExpert>} : () -> ()
  }) : () -> ()
  "hal.executable.variant_end"() : () -> ()
}) {sym_name = "embedded_elf_x86_64", target = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>} : () -> ()

Steps to reproduce your issue

git clone https://github.com/nod-ai/SHARK-TestSuite cd SHARK-TestSuite/e2eshark/

Passed: standalone torch-mlir-opt + iree: onnx -> torch -> linalg -> vmfb python ./run.py --torchmlirbuild ../../torch-mlir/build --tolerance 0.001 0.001 --cachedir ./huggingface_cache --ireebuild ../../iree-build -f onnx -g models --mode onnx --report --tests onnx/models/Inception_v4_vaiq_int8 --torchtolinalg
Failed: iree: onnx -> vmfb: python ./run.py --tolerance 0.001 0.001 --cachedir ./huggingface_cache --ireebuild ../../iree-build -f onnx -g models --mode onnx --report --tests onnx/models/Inception_v4_vaiq_int8

You can find the Inception_v4_vaiq_int8.default.torch-onnx.mlir file by cd SHARK-TestSuite/e2eshark/test-run/onnx/models/Inception_v4_vaiq_int8

What component(s) does this issue relate to?

Compiler

Version information

iree: candidate-20240704.944 torch-mlir : ca0e9066755b35c0889c6ab792265b0886325f50

Additional context

No response

iree-org / iree

Inconsistency between iree-compile and standalone torch-mlir-opt compile #17832

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context