Open FZR95 opened 8 months ago
Sorry our team doesn't have access to the kind of hardware. We cannot debug the issue. If you know where the problem is, welcome to help us fix it.
I'm having what seems to be a similar issue (v1.18.0).
This issue has been confirmed on VisionFive 2 and Yocto QEMU.
$ python3 label_image.py --model mobilenetv2-10.onnx --label ./synset.txt --image kitten.jpg
Result: VisionFive 2 and Yocto QEMU
Inference result:
class=n03868863 oxygen mask ; probability=6.158878
class=n03045698 cloak ; probability=5.869998
class=n03196217 digital clock ; probability=5.623226
class=n03770439 miniskirt, mini ; probability=5.581253
class=n04254680 soccer ball ; probability=5.542237
Expected behavior
tabby, tabby cat
are displayed as inference results.
I get the correct results on a Raspberry Pi or x86 PC.
When I ran onnx_test_runner on Yocto QEMU, gemm_activation_fusion
failed and the results were different from x86.
gemm_activation_fusion
does not fail on an x86 PC.
[[1;31m2024-06-12 06:59:47.394317600 [E:onnxruntime:Default, dataitem_request.cc:212 RunImpl] gemm_activation_fusion:output=z:expected 2.34025 (4015c6b4), got 0.389875 (3ec79db5), diff: 1.95038, tol=0.00334025 idx=4. 12 of 12 differ^[[m
[[1;31m2024-06-12 06:59:47.395344300 [E:onnxruntime:Default, testcase_request.cc:194 CalculateAndLogStats] gemm_activation_fusion: result differs. Dataset:/usr/share/onnxruntime/test/testdata/transform/gemm_activation_fusion/test_data_set_0
Describe the issue
I am trying to build onnxruntime for a RISCV target platform (LicheeRV Nano). I have succeeded the built of libonnxruntime.so using the cross compiler (riscv64-unknown-linux-musl) and no other errors reported. The same test project and model runs without errors on both Raspberry Pi (arm) and PC, the model accuracy is normal, but on the RISCV target platform it drops to 15% accuracy, which is equivalent to a completely wrong state.
I have tried several options, but they don't work: -Donnxruntime_DISABLE_CONTRIB_OPS=ON -Donnxruntime_DONT_VECTORIZE=ON -Donnxruntime_DISABLE_ABSEIL=ON
The model is a simple 3-layer CNN including conv1d, relu and maxpool. I can at least confirm that the problem occurs after the first convolutional block (can't be sure it's the conv OP's fault).
I don't have any idea at the moment, so I'd appreciate if you could give me some ideas!
Urgency
No response
Target platform
RISCV (LicheeRV Nano)
Build script
cmake ../cmake \ -Donnxruntime_GCC_STATIC_CPP_RUNTIME=ON \ -DCMAKE_BUILD_TYPE=Release \ -Donnxruntime_BUILD_SHARED_LIB=ON \ -Donnxruntime_BUILD_UNIT_TESTS=OFF \ -Donnxruntime_ENABLE_CPUINFO=OFF \ -DCMAKE_TOOLCHAIN_FILE=linux_lichee_crosscompile_toolchain.cmake
linux_lichee_crosscompile_toolchain.cmake
SET(CMAKE_SYSTEM_NAME Linux) SET(CMAKE_SYSTEM_VERSION 1) set(tools "/opt/host-tools/gcc/riscv64-linux-musl-x86_64") set(CMAKE_C_COMPILER ${tools}/bin/riscv64-unknown-linux-musl-gcc) set(CMAKE_CXX_COMPILER ${tools}/bin/riscv64-unknown-linux-musl-g++) set(CMAKE_SYSTEM_PROCESSOR riscv64) set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mcpu=c906fdv -march=rv64imafdcv0p7xthead -mcmodel=medany -mabi=lp64d") SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER) SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY) SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY) SET(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
Error / output
RISCV platform: ONNX Runtime version: 1.18.0 Input Node Name/Shape (0): input : 1x9x128 Output Node Name/Shape (0): output : -1x6 Accuracy: 15.23 %
PC platform: ONNX Runtime version: 1.18.0 Input Node Name/Shape (0): input : 1x9x128 Output Node Name/Shape (0): output : -1x6 Accuracy: 86.09 %
Visual Studio Version
No response
GCC / Compiler Version
C++ compiler version : 10.2.0