Open sarmentow opened 1 month ago
Okay, just ran the tests in onnxruntime/test/python/onnxruntime_test_python.py
and we're failing test_memory_arena_shrinkage
, test_run_model2
, test_run_model2_contiguous
, and test_run_model_symbolic_input
.
All of the test_run tests fail with the same error which appears to corroborate what the issue mentions about getting very different results:
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-08
Mismatched elements: 3 / 3 (100%)
Max absolute difference: 12.
Max relative difference: 4.
Will look into it later.
'm experiencing the same issue when trying to use the exported ONNX model from: https://github.com/IBM/ai-on-z-fraud-detection/blob/main/ccf_220_keras_lstm_static-OS.ipynb
The model runs fine on x86 architecture, but on RISC-V, it produces completely incorrect outputs using the same code and input data.
Describe the issue
We have successfully built ONNX Runtime Python wheels targeting the RISC-V architecture, using both the cross-compilation process outlined in the documentation and an emulated RISC-V Docker container running Ubuntu 22.04. Both builds completed without errors.
However, after training a model in PyTorch and exporting it to the ONNX format, we observed that the inference results from the ONNX Runtime Python package vary significantly across platforms. Specifically, the results from the RISC-V wheels we built (both the cross-compiled and the emulated versions) do not match the expected outputs seen from running inference in PyTorch before the ONNX export, nor do they match the outputs produced by the ONNX Runtime x64 wheel on the same model.
This leads us to believe that the issue lies in the ONNX Runtime's support for RISC-V.
Example Outputs
To illustrate the discrepancy, after training a PyTorch model, we get the following outputs for the input
[0]
when using the pre-built ONNX Runtime wheels for x64:In contrast, the output from the RISC-V wheel for the same model and input is:
Both outputs are from the same model, using the same input, highlighting the inconsistency.
Investigation
Through extensive troubleshooting, we have identified that this discrepancy occurs specifically when using
torch.nn.Linear
layers. Basic arithmetic operators (e.g.,+
,-
,*
,/
) do not cause any issues. Furthermore, exporting the model using PyTorch's.pth
format and running inference in a RISC-V environment works as expected, further reinforcing that the issue may reside within ONNX Runtime's handling of RISC-V architectures. We are fairly sure this is a problem in ONNX Runtime since we have tested the model export using Pytorch's.pth
format and it has worked fine in the RISC-V environment.Reproduction
We have included the PyTorch training code, the Dockerfile for the build environment, and the scripts used to compare inference results between the platforms below.
Model Training Code
Dockerfile for Build Environment
We also built CMake from source in order to obey the version requirements for building ONNX Runtime. I have pushed an image to Docker Hub with CMake 3.30 installed to save you the hassle:
docker pull sarmentow/onnxruntime-build-env-with-cmake
.ONNX Runtime Inference Comparison Code
Build Process
We used the following command to build the ONNX Runtime wheel for RISC-V (the build.py file at
tool/ci_build/build.py
:Testing Environment
We utilized the following Dockerfile for the testing environment:
Dependencies
We installed ONNX Runtime and other necessary packages as listed in the
requirements.txt
file:We used an alternative pip index to install Numpy RISC-V wheels, which we believe are not causing the issue.
Conclusion
Based on our testing, it seems that the issue is specific to the ONNX Runtime's support for RISC-V, particularly when using certain layers such as
torch.nn.Linear
. One linear layer is enough to see the discrepancies between platforms. We hope this information helps in diagnosing the problem, and we are happy to assist further if needed.Thank you for your attention to this matter. We look forward to your insights.
Urgency
The issue is urgent as my team depends on this functionality to ship a project this week. We'd be extremely grateful for some attention on this.
Target platform
RISC-V
Build script
We used the following command to build the ONNX Runtime wheel for RISC-V (the build.py file at
tool/ci_build/build.py
:Inside a container running the Docker image at
sarmentow/onnxruntime-build-env-with-cmake
Error / output
To illustrate the discrepancy, after training a PyTorch model, we get the following outputs for the input
[0]
when using the pre-built ONNX Runtime wheels for x64:In contrast, the output from the RISC-V wheel for the same model and input is:
Both outputs are from the same model, using the same input, highlighting the inconsistency.
Visual Studio Version
No response
GCC / Compiler Version
11.4.0