onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
770 stars 321 forks source link

Re-enable the convolution based models and remove nnpa-saturation=true #2959

Closed hamptonm1 closed 1 month ago

hamptonm1 commented 1 month ago

Based on the commentary from @negiyas and the following PR: https://github.com/onnx/onnx-mlir/issues/2817

It appears that the convolution models cannot be enabled with the --dimParams option and needs to be disabled. This essentially means the test will not be added to the dynamic shape test but will still run as expected for the other checks. With the following update, nnpa-saturation does not need to be changed to true but it can remain as the default which is false.

AlexandreEichenberger commented 1 month ago

IMO, we want to continue testing resnet and others with dynamic sizes. Didn't they work in the past? Is there a regression?

For convolution based tests, typically the size of the filter is known, but maybe not the size of the image or batch size. Is it a case where we set too much as unknown?

hamptonm1 commented 1 month ago

IMO, we want to continue testing resnet and others with dynamic sizes. Didn't they work in the past? Is there a regression?

For convolution based tests, typically the size of the filter is known, but maybe not the size of the image or batch size. Is it a case where we set too much as unknown?

@AlexandreEichenberger They never worked in the past with dynamic sizes. Please see the PR when @negiyas introduced dimParams : https://github.com/onnx/onnx-mlir/pull/2781/files#diff-b11532abe6cd5ac912cbc97cc1d62432aa93a6ff1045324b908cfaf3309e5b2f. The tests were commented out in this PR which indicates it never worked for dynamic shape. Before dimParams was introduced it worked. The PR referenced above shows if there are too many unknowns, the backend tests will fail with some sort of runtime error. Not all backend tests can enable the dimParams option.

AlexandreEichenberger commented 1 month ago

I just ran

make -j 12 check-onnx-backend-dynamic
[...]
.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss[1/6] Wed Oct  2 15:55:17 2024 (0s) Importing ONNX Model to MLIR Module from "resnet50.onnx"
[2/6] Wed Oct  2 15:55:17 2024 (0s) Compiling and Optimizing MLIR Module
[3/6] Wed Oct  2 15:55:19 2024 (2s) Translating MLIR Module to LLVM and Generating LLVM Optimized Bitcode
[4/6] Wed Oct  2 15:55:21 2024 (4s) Generating Object from LLVM Bitcode
[5/6] Wed Oct  2 15:55:24 2024 (7s) Linking and Generating the Output Shared Library
[6/6] Wed Oct  2 15:55:24 2024 (7s) Compilation completed
.sssssssssssssssssssssssssssssssssssssssssssssssssssssss
----------------------------------------------------------------------
Ran 2894 tests in 229.174s

OK (skipped=2361)

so resnet runs successfully on Mac

hamptonm1 commented 1 month ago

I just ran

make -j 12 check-onnx-backend-dynamic
[...]
.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss[1/6] Wed Oct  2 15:55:17 2024 (0s) Importing ONNX Model to MLIR Module from "resnet50.onnx"
[2/6] Wed Oct  2 15:55:17 2024 (0s) Compiling and Optimizing MLIR Module
[3/6] Wed Oct  2 15:55:19 2024 (2s) Translating MLIR Module to LLVM and Generating LLVM Optimized Bitcode
[4/6] Wed Oct  2 15:55:21 2024 (4s) Generating Object from LLVM Bitcode
[5/6] Wed Oct  2 15:55:24 2024 (7s) Linking and Generating the Output Shared Library
[6/6] Wed Oct  2 15:55:24 2024 (7s) Compilation completed
.sssssssssssssssssssssssssssssssssssssssssssssssssssssss
----------------------------------------------------------------------
Ran 2894 tests in 229.174s

OK (skipped=2361)

so resnet runs successfully on Mac

Then why do these test fail if we are not using nnpa-saturation = true which includes resent?

hamptonm1 commented 1 month ago

These tests have been commented out since April :

    #test_inception_v1_cpu,zdnn_conv2d
    #test_resnet50_cpu,zdnn_conv2d
    #test_shufflenet_cpu,zdnn_matmul_op
    #test_squeezenet_cpu,zdnn_conv
    #test_vgg19_cpu,zdnn_conv
tungld commented 1 month ago

check-onnx-backend-dynamic is not for NNPA. It would be check-onnx-backend-dynamic-nnpa

imaihal commented 1 month ago

In my z16 environment, without --nnpa-satuation=true, I got errors about range violation both static and dynamic tests. With --nnpa-satuation=true, it seems both static and dynamic tests pass.

hamptonm1 commented 1 month ago

In my z16 environment, without --nnpa-satuation=true, I got errors about range violation both static and dynamic tests. With --nnpa-satuation=true, it seems both static and dynamic tests pass.

  • Static test TEST_CASE_BY_USER=test_inception_v1_cpu,test_resnet50_cpu,test_shufflenet_cpu,test_squeezenet_cpu,test_vgg19_cpu make -j5 check-onnx-backend-nnpa
  • Dynamic test TEST_CASE_BY_USER=test_inception_v1_cpu,test_resnet50_cpu,test_shufflenet_cpu,test_squeezenet_cpu,test_vgg19_cpu make -j5 check-onnx-backend-nnpa-dyamic

@imaihal Interesting! Thanks so much.... @AlexandreEichenberger did you test test_inception_v1_cp seems like that fails without nnpa-saturation .

hamptonm1 commented 1 month ago

I will close this PR given @AlexandreEichenberger feedback. He wants to keep the nnpa-saturation=true and further debug why the convolution backend tests are not working.