onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
770 stars 321 forks source link

Implement decomposition from `onnx.Sum` to sequence of `onnx.Add` #2964

Closed srcarroll closed 1 month ago

jenkins-droid commented 1 month ago

Can one of the admins verify this patch?

srcarroll commented 1 month ago

I can't find any current lowering path for onnx.Sum. I figured it would be best to add decomposition pattern here rather than do a direct conversion to stablehlo. I'm open to suggestions if this is not a desirable change.

jenkins-droid commented 1 month ago

Can one of the admins verify this patch?

jenkins-droid commented 1 month ago

Can one of the admins verify this patch?

srcarroll commented 1 month ago

just realized i messsed up the one input case. fixing now

jenkins-droid commented 1 month ago

Can one of the admins verify this patch?

srcarroll commented 1 month ago

not sure what's up with the jenkins failures though

AlexandreEichenberger commented 1 month ago

Starting a new batch of CI, let me look at the PR soon.

srcarroll commented 1 month ago

Starting a new batch of CI, let me look at the PR soon.

Thanks. I still need to figure out the test failure too. I thought fixing the one input case woud have resolved it but it didn't. I'm unfamiliar with the test that's failing so will have to dig.

AlexandreEichenberger commented 1 month ago

@srcarroll here is the log of the error for Linux amd64

/usr/lib/python3.10/subprocess.py:526: CalledProcessError
----------------------------- Captured stdout call -----------------------------
[1/6] Thu Oct  3 09:08:26 2024 (0s) Importing ONNX Model to MLIR Module from "test_sum_one_input.onnx"
[2/6] Thu Oct  3 09:08:26 2024 (0s) Compiling and Optimizing MLIR Module
----------------------------- Captured stderr call -----------------------------
error: failed to legalize unresolved materialization from ('tensor<?xf32>') to 'tensor<*xf32>' that remained live after conversion
=========================== short test summary info ============================
 Debug/test.py::OnnxBackendNodeModelTest::test_sum_one_input_cpu - subprocess...
1 failed, 521 passed, 2344 skipped in 84.49s (0:01:24)
make[3]: *** [test/backend/CMakeFiles/check-onnx-backend-dynamic.dir/build.make:71: test/backend/CMakeFiles/check-onnx-backend-dynamic] Error 1
make[2]: *** [CMakeFiles/Makefile2:14510: test/backend/CMakeFiles/check-onnx-backend-dynamic.dir/all] Error 2
make[2]: *** Waiting for unfinished jobs....
bringing up nodes...
bringing up nodes...

It feels like something is missing for the shape inference. If you convert the sum into a series of add before shape infrence, then shape inference should be able to operate normally. Maybe you need to put it earlier than currently placed.

You can run the test directly in your env

TEST_CASE_BY_USER=test_sum_one_input_cpu make -j 12 check-onnx-backend-dynamic

You can add compiler options (for debugging) by defining the ONNX_MLIR_FLAGS env var

Hope this helps.

AlexandreEichenberger commented 1 month ago

Note on the check-onnx-backend-dynamic: we force all the input sizes to be made dynamic (-1) at compilation time and then run the model with the original sizes set at run time. That ensure us that all of the ONNX tests run both with static (as they are in ONNX repo) and dynamic (as we force them to be here).

srcarroll commented 1 month ago

@AlexandreEichenberger thanks for the pointers. I'm actually having a hard time setting up my environment to run python tests and haven't had time to look into fixing it. so i'm just guessing based on the error what the problem is. i think this error would happen any time the result type of onnx.Sum is an unranked tensor, but the inputs are ranked. hence why it complains about unrealized cast from tensor<?xf32> -> tensor<*xf32>. so one solution i can think of is to just cast the result of the final onnx.Add to the original onnx.Sum result type when they don't match. i'll push soon

i was able to reproduce the error on an example IR and have added that case to lit test to test the fix

jenkins-droid commented 1 month ago

Can one of the admins verify this patch?

AlexandreEichenberger commented 1 month ago

@jenkins-droid test this please

AlexandreEichenberger commented 1 month ago

Yes, setting up is a bit tricky, to get ONNX installed (I do a pip install -e third_party/onnx, protobuf sometimes cause problems). That is why you can also develop on docker images that we publish, where all should be working smoothly.

AlexandreEichenberger commented 1 month ago

@jenkins-droid test this please

AlexandreEichenberger commented 1 month ago

(I think our Jenkins is having problems, folks are working on this)

AlexandreEichenberger commented 1 month ago

@Sunny-Anand mind doing a "test this please" on this PR once the Jenkins issue is fixed, thanks

Sunny-Anand commented 1 month ago

All checks have passed.

srcarroll commented 1 month ago

@AlexandreEichenberger thanks! yes please do merge.

AlexandreEichenberger commented 1 month ago

@srcarroll thanks for your contributions, much appreciated

jenkins-droid commented 1 month ago

Jenkins Linux amd64 Build #15775 [push] Implement decomposition ... started at 18:50

jenkins-droid commented 1 month ago

Jenkins Linux s390x Build #15778 [push] Implement decomposition ... started at 19:50

jenkins-droid commented 1 month ago

Jenkins Linux ppc64le Build #14805 [push] Implement decomposition ... started at 20:02

jenkins-droid commented 1 month ago

Jenkins Linux amd64 Build #15775 [push] Implement decomposition ... passed after 1 hr 18 min

jenkins-droid commented 1 month ago

Jenkins Linux s390x Build #15778 [push] Implement decomposition ... passed after 1 hr 42 min

jenkins-droid commented 1 month ago

Jenkins Linux ppc64le Build #14805 [push] Implement decomposition ... passed after 2 hr 17 min