Cannot compile bidaf-9.onnx in the model zoo.

negiyas commented 2 years ago

The latest main branch of onnx-mlir cannot compile bidaf-9.onnx in the model zoo. The following errors occur while compilation.

$
$ wget https://github.com/onnx/models/blob/main/text/machine_comprehension/bidirectional_attention_flow/model/bidaf-9.onnx?raw=true -O bidaf-9.onnx
...
$ ./build/Debug/bin/onnx-mlir bidaf-9.onnx
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
error: Invalid axis value
error: shape inference failed

The visualized results by https://netron.app/ is as follows.

 node {
    input: "start_max_index"
    output: "Squeeze_28"
    name: "Squeeze_28"
    op_type: "Squeeze"
    attribute {
      name: "axes"
      ints: 2
      type: INTS
    }
  }

The error comes from onnx.Suqeeze op, according to the shape inference results, the input rank is 2, so that onnx.squeeze with axes=2 is invalid (should be less than 2).

It seems that the mismatch comes from the previous onnx.Scan op in the graph. The output dimension of the onnx.Scan op should be "cx1x1" in the graph, but the rank by the shape inference is 2. It is not consistent. (The input dimension of the onnx.Scan op is "cx1x1" in the graph, and the rank by the shape inference is 3. It is consistent.)

negiyas commented 2 years ago

It seems that the imported onnx.Scan op is not correct.

The newron.app outputs ( https://netron.app/ ) of an onnx.Scan op ("Scan_20") are as follows.

I generated a mlir file by ./build/Debug/bin/onnx-mlir bidaf-9.onnx --EmitONNXBasic The imported onnx.Scan op in the generated bidaf-9.onnx.mlir file is as follows. In the generated mlir, there are some differences.

The fourth input dim should not be "<1x1xf32>" , but should be "<-1x1x1xf32>".
The fourth and fifth output dim should not be "<1x1xf32>", but should be "<-1x1x1xf32">

    %302:5 = "onnx.Scan"(%299, %300, %301, %298) ({
    ^bb0(%arg4: tensor<1x1xf32>, %arg5: tensor<1x1xf32>, %arg6: tensor<1xf32>, %arg7: tensor<1x1xf32>):
      %316 = "onnx.Greater"(%arg7, %arg4) {onnx_node_name = "start_max_Greater_12"} : (tensor<1x1xf32>, tensor<1x1xf32>) -> tensor<*xi1>
      %317 = "onnx.Where"(%316, %arg7, %arg4) {onnx_node_name = "start_max_Where_13"} : (tensor<*xi1>, tensor<1x1xf32>, tensor<1x1xf32>) -> tensor<*xf32>
      %318 = "onnx.Where"(%316, %arg6, %arg5) {onnx_node_name = "start_max_Where_14"} : (tensor<*xi1>, tensor<1xf32>, tensor<1x1xf32>) -> tensor<*xf32>
      %319 = "onnx.Constant"() {value = dense<1.000000e+00> : tensor<1xf32>} : () -> tensor<1xf32>
      %320 = "onnx.Add"(%arg6, %319) {onnx_node_name = "start_max_Add_15"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
      %321 = "onnx.Identity"(%317) {onnx_node_name = "start_max_Identity_16"} : (tensor<*xf32>) -> tensor<1x1xf32>
      %322 = "onnx.Identity"(%318) {onnx_node_name = "start_max_Identity_17"} : (tensor<*xf32>) -> tensor<1x1xf32>
      %323 = "onnx.Identity"(%317) {onnx_node_name = "start_max_Identity_18"} : (tensor<*xf32>) -> tensor<1x1xf32>
      %324 = "onnx.Identity"(%318) {onnx_node_name = "start_max_Identity_19"} : (tensor<*xf32>) -> tensor<1x1xf32>
      onnx.Return %321, %322, %320, %323, %324 : tensor<1x1xf32>, tensor<1x1xf32>, tensor<1xf32>, tensor<1x1xf32>, tensor<1x1xf32>
    }) {input_names = ["start_max__v_subgraph", "start_max__i_subgraph", "start_max__counter_subgraph", "Log11393_Output_0_subgraph"], num_scan_inputs = 1 : si64, onnx_node_name = "Scan_20", output_names = ["start_max__v", "start_max__i", "start_max__counter", "start_max_value", "start_max_index"], scan_input_directions = [0], scan_output_directions = [0, 0]} : (tensor<1x1xf32>, tensor<1x1xf32>, tensor<1xf32>, tensor<*xf32>) -> (tensor<1x1xf32>, tensor<1x1xf32>, tensor<1xf32>, tensor<1x1xf32>, tensor<1x1xf32>)

negiyas commented 2 years ago

Results of other test results those may be related to the bidaf-9 compilation issue.

The backend test for "test_scan_sum_cpu" fails, although test for "test_scan9_sum_cpu" passed.
Shape-inference of onnx.ClipV6 op fails.

Look at the attached chart for the detail.

bidafScanIssue-20220525.pptx

negiyas commented 2 years ago

Found the following facts with investigation of the Bidaf-9 model, opsets of onnx.Scan, and onnx-mlir.

Bidaf-9 uses opset-8 of onnx.Scan op with dynamic shape (dynamic batch size).
Onnx-mlir supports opset-9 and 11 with static shape.
Specification and functions of opset-8 and 9 are not compatible. => Onnx-mlir cannot support the Bidaf-9 model.
The bidaf-9 model can be supported by enabling opset-8 of onnx.Scan by onnx-mlir, because all Bidaf-9 issues probably come from the opset-8 and 9 difference.

negiyas commented 1 year ago

Closing this issue because onnx.Scan can be supported by onnx-mlir.

onnx / onnx-mlir

Cannot compile bidaf-9.onnx in the model zoo. #1442