onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
760 stars 319 forks source link

Cannot compile bidaf-9.onnx in the model zoo. #1442

Closed negiyas closed 1 year ago

negiyas commented 2 years ago

The latest main branch of onnx-mlir cannot compile bidaf-9.onnx in the model zoo. The following errors occur while compilation.

$
$ wget https://github.com/onnx/models/blob/main/text/machine_comprehension/bidirectional_attention_flow/model/bidaf-9.onnx?raw=true -O bidaf-9.onnx
...
$ ./build/Debug/bin/onnx-mlir bidaf-9.onnx
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
Warning: ONNX CategoryMapper in your model is using Opset 1, which is quite old. Please consider regenerating your model with a newer Opset.
error: Invalid axis value
error: shape inference failed

The visualized results by https://netron.app/ is as follows.

image

 node {
    input: "start_max_index"
    output: "Squeeze_28"
    name: "Squeeze_28"
    op_type: "Squeeze"
    attribute {
      name: "axes"
      ints: 2
      type: INTS
    }
  }

The error comes from onnx.Suqeeze op, according to the shape inference results, the input rank is 2, so that onnx.squeeze with axes=2 is invalid (should be less than 2).

It seems that the mismatch comes from the previous onnx.Scan op in the graph. The output dimension of the onnx.Scan op should be "cx1x1" in the graph, but the rank by the shape inference is 2. It is not consistent. (The input dimension of the onnx.Scan op is "cx1x1" in the graph, and the rank by the shape inference is 3. It is consistent.)

negiyas commented 2 years ago

It seems that the imported onnx.Scan op is not correct.

The newron.app outputs ( https://netron.app/ ) of an onnx.Scan op ("Scan_20") are as follows. image

I generated a mlir file by ./build/Debug/bin/onnx-mlir bidaf-9.onnx --EmitONNXBasic The imported onnx.Scan op in the generated bidaf-9.onnx.mlir file is as follows. In the generated mlir, there are some differences.

  1. The fourth input dim should not be "<1x1xf32>" , but should be "<-1x1x1xf32>".
  2. The fourth and fifth output dim should not be "<1x1xf32>", but should be "<-1x1x1xf32">
    %302:5 = "onnx.Scan"(%299, %300, %301, %298) ({
    ^bb0(%arg4: tensor<1x1xf32>, %arg5: tensor<1x1xf32>, %arg6: tensor<1xf32>, %arg7: tensor<1x1xf32>):
      %316 = "onnx.Greater"(%arg7, %arg4) {onnx_node_name = "start_max_Greater_12"} : (tensor<1x1xf32>, tensor<1x1xf32>) -> tensor<*xi1>
      %317 = "onnx.Where"(%316, %arg7, %arg4) {onnx_node_name = "start_max_Where_13"} : (tensor<*xi1>, tensor<1x1xf32>, tensor<1x1xf32>) -> tensor<*xf32>
      %318 = "onnx.Where"(%316, %arg6, %arg5) {onnx_node_name = "start_max_Where_14"} : (tensor<*xi1>, tensor<1xf32>, tensor<1x1xf32>) -> tensor<*xf32>
      %319 = "onnx.Constant"() {value = dense<1.000000e+00> : tensor<1xf32>} : () -> tensor<1xf32>
      %320 = "onnx.Add"(%arg6, %319) {onnx_node_name = "start_max_Add_15"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
      %321 = "onnx.Identity"(%317) {onnx_node_name = "start_max_Identity_16"} : (tensor<*xf32>) -> tensor<1x1xf32>
      %322 = "onnx.Identity"(%318) {onnx_node_name = "start_max_Identity_17"} : (tensor<*xf32>) -> tensor<1x1xf32>
      %323 = "onnx.Identity"(%317) {onnx_node_name = "start_max_Identity_18"} : (tensor<*xf32>) -> tensor<1x1xf32>
      %324 = "onnx.Identity"(%318) {onnx_node_name = "start_max_Identity_19"} : (tensor<*xf32>) -> tensor<1x1xf32>
      onnx.Return %321, %322, %320, %323, %324 : tensor<1x1xf32>, tensor<1x1xf32>, tensor<1xf32>, tensor<1x1xf32>, tensor<1x1xf32>
    }) {input_names = ["start_max__v_subgraph", "start_max__i_subgraph", "start_max__counter_subgraph", "Log11393_Output_0_subgraph"], num_scan_inputs = 1 : si64, onnx_node_name = "Scan_20", output_names = ["start_max__v", "start_max__i", "start_max__counter", "start_max_value", "start_max_index"], scan_input_directions = [0], scan_output_directions = [0, 0]} : (tensor<1x1xf32>, tensor<1x1xf32>, tensor<1xf32>, tensor<*xf32>) -> (tensor<1x1xf32>, tensor<1x1xf32>, tensor<1xf32>, tensor<1x1xf32>, tensor<1x1xf32>)

image

negiyas commented 2 years ago

Results of other test results those may be related to the bidaf-9 compilation issue.

Look at the attached chart for the detail.

bidafScanIssue-20220525.pptx

negiyas commented 2 years ago

Found the following facts with investigation of the Bidaf-9 model, opsets of onnx.Scan, and onnx-mlir.

negiyas commented 1 year ago

Closing this issue because onnx.Scan can be supported by onnx-mlir.