Incorrect Numerics for a f32 Depthwise Conv Op

What happened?

The following depthwise convolution op (ingested from an onnx model) seems to generate outputs through IREE on cpu differing substantially from the results generated from onnxruntime's CPU implementation.

module {
  func.func @main(%arg0: !torch.vtensor<[1,256,112,112],f32>, %arg1: !torch.vtensor<[256,1,3,3],f32>, %arg2: !torch.vtensor<[256],f32>) -> !torch.vtensor<[1,256,56,56],f32> attributes {torch.onnx_meta.ir_version = 10 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.producer_name = "", torch.onnx_meta.producer_version = ""} {
    %none = torch.constant.none
    %0 = torch.operator "onnx.Conv"(%arg0, %arg1, %arg2) {torch.onnx.group = 256 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [2 : si64, 2 : si64]} : (!torch.vtensor<[1,256,112,112],f32>, !torch.vtensor<[256,1,3,3],f32>, !torch.vtensor<[256],f32>) -> !torch.vtensor<[1,256,56,56],f32> 
    return %0 : !torch.vtensor<[1,256,56,56],f32>
  }
}

Compiling succeeds on mi300 but crashes on iree-run-module.

Steps to reproduce your issue

Set up the Shark Test suite for a local IREE build (Linux):

copy/paste to start setup of test suite. Might have to change the original python executable name to whatever your system/ IREE build prefers.

git clone https://github.com/nod-ai/SHARK-TestSuite.git
cd SHARK-TestSuite/alt_e2eshark/
python3.11 -m venv ts.venv
source ts.venv/bin/activate
pip install --upgrade pip
pip install -r base_requirements.txt
pip install --no-deps -r torch_mlir_requirements.txt

Then edit the following with the path to your IREE build (if built with python bindings).

IREE_BUILD_DIR = <replace with path to iree-build> && \
source ${IREE_BUILD_DIR}/.env && export PYTHONPATH

If you do not have iree-compile and iree-run-module on your path, add them.

run test

python run.py -t conv_depthwise -v -m cl-onnx-iree

inspect results

The run.py script should generate a sub-directory ./test-run/conv_depthwise_stride_2/. With the mode cl-onnx-iree, this should also generate a /commands/ directory with compile and run-module commands. Inspect inference_comparison.log to see input, output, and gold output printouts.

Test on GPU

python run.py -t conv_depthwise -v -m cl-onnx-iree -b rocm -d hip -ica "iree-hip-target=gfx942"

Fails on iree-run-module (stage is called "compiled_inference").

What component(s) does this issue relate to?

Frontends, MLIR, Compiler

Version information

Local build at commit ae6e5d323ecc63e421c79768087f63dc42490cd2

Additional context

Affects a few models, e.g. "maxvit_rmlp_base_rw_224.sw_in12k".

If it is helpful, here is some associated linalg IR:

module {
  ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
  func.func @main(%arg0: tensor<1x256x112x112xf32>, %arg1: tensor<256x1x3x3xf32>, %arg2: tensor<256xf32>) -> tensor<1x256x56x56xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %padded = tensor.pad %arg0 low[0, 0, 1, 1] high[0, 0, 1, 1] {
    ^bb0(%arg3: index, %arg4: index, %arg5: index, %arg6: index):
      tensor.yield %cst : f32
    } : tensor<1x256x112x112xf32> to tensor<1x256x114x114xf32>
    %0 = tensor.empty() : tensor<1x256x56x56xf32>
    %broadcasted = linalg.broadcast ins(%arg2 : tensor<256xf32>) outs(%0 : tensor<1x256x56x56xf32>) dimensions = [0, 2, 3] 
    %collapsed = tensor.collapse_shape %arg1 [[0, 1], [2], [3]] : tensor<256x1x3x3xf32> into tensor<256x3x3xf32>
    %1 = linalg.depthwise_conv_2d_nchw_chw {dilations = dense<1> : vector<2xi64>, strides = dense<2> : vector<2xi64>} ins(%padded, %collapsed : tensor<1x256x114x114xf32>, tensor<256x3x3xf32>) outs(%broadcasted : tensor<1x256x56x56xf32>) -> tensor<1x256x56x56xf32>
    return %1 : tensor<1x256x56x56xf32>
  }
}

iree-org / iree