microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.71k stars 2.79k forks source link

[Build] Encountered error executing OPs on RISCV platform #20030

Open FZR95 opened 4 months ago

FZR95 commented 4 months ago

Describe the issue

I am trying to build onnxruntime for a RISCV target platform (LicheeRV Nano). I have succeeded the built of libonnxruntime.so using the cross compiler (riscv64-unknown-linux-musl) and no other errors reported. The same test project and model runs without errors on both Raspberry Pi (arm) and PC, the model accuracy is normal, but on the RISCV target platform it drops to 15% accuracy, which is equivalent to a completely wrong state.

I have tried several options, but they don't work: -Donnxruntime_DISABLE_CONTRIB_OPS=ON -Donnxruntime_DONT_VECTORIZE=ON -Donnxruntime_DISABLE_ABSEIL=ON

The model is a simple 3-layer CNN including conv1d, relu and maxpool. I can at least confirm that the problem occurs after the first convolutional block (can't be sure it's the conv OP's fault).

I don't have any idea at the moment, so I'd appreciate if you could give me some ideas!

Urgency

No response

Target platform

RISCV (LicheeRV Nano)

Build script

cmake ../cmake \ -Donnxruntime_GCC_STATIC_CPP_RUNTIME=ON \ -DCMAKE_BUILD_TYPE=Release \ -Donnxruntime_BUILD_SHARED_LIB=ON \ -Donnxruntime_BUILD_UNIT_TESTS=OFF \ -Donnxruntime_ENABLE_CPUINFO=OFF \ -DCMAKE_TOOLCHAIN_FILE=linux_lichee_crosscompile_toolchain.cmake

linux_lichee_crosscompile_toolchain.cmake

SET(CMAKE_SYSTEM_NAME Linux) SET(CMAKE_SYSTEM_VERSION 1) set(tools "/opt/host-tools/gcc/riscv64-linux-musl-x86_64") set(CMAKE_C_COMPILER ${tools}/bin/riscv64-unknown-linux-musl-gcc) set(CMAKE_CXX_COMPILER ${tools}/bin/riscv64-unknown-linux-musl-g++) set(CMAKE_SYSTEM_PROCESSOR riscv64) set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mcpu=c906fdv -march=rv64imafdcv0p7xthead -mcmodel=medany -mabi=lp64d") SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER) SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY) SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY) SET(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)

Error / output

RISCV platform: ONNX Runtime version: 1.18.0 Input Node Name/Shape (0): input : 1x9x128 Output Node Name/Shape (0): output : -1x6 Accuracy: 15.23 %

PC platform: ONNX Runtime version: 1.18.0 Input Node Name/Shape (0): input : 1x9x128 Output Node Name/Shape (0): output : -1x6 Accuracy: 86.09 %

Visual Studio Version

No response

GCC / Compiler Version

C++ compiler version : 10.2.0

snnn commented 4 months ago

Sorry our team doesn't have access to the kind of hardware. We cannot debug the issue. If you know where the problem is, welcome to help us fix it.

NobuoTsukamoto commented 1 month ago

I'm having what seems to be a similar issue (v1.18.0).
This issue has been confirmed on VisionFive 2 and Yocto QEMU.

Target platform

Error / output

$ python3 label_image.py --model mobilenetv2-10.onnx --label ./synset.txt --image kitten.jpg 

Result: VisionFive 2 and Yocto QEMU

Inference result:
  class=n03868863 oxygen mask ; probability=6.158878
  class=n03045698 cloak ; probability=5.869998
  class=n03196217 digital clock ; probability=5.623226
  class=n03770439 miniskirt, mini ; probability=5.581253
  class=n04254680 soccer ball ; probability=5.542237

Expected behavior tabby, tabby cat are displayed as inference results.
I get the correct results on a Raspberry Pi or x86 PC.
When I ran onnx_test_runner on Yocto QEMU, gemm_activation_fusion failed and the results were different from x86. gemm_activation_fusion does not fail on an x86 PC.

[[1;31m2024-06-12 06:59:47.394317600 [E:onnxruntime:Default, dataitem_request.cc:212 RunImpl] gemm_activation_fusion:output=z:expected 2.34025 (4015c6b4), got 0.389875 (3ec79db5), diff: 1.95038, tol=0.00334025 idx=4. 12 of 12 differ^[[m
[[1;31m2024-06-12 06:59:47.395344300 [E:onnxruntime:Default, testcase_request.cc:194 CalculateAndLogStats] gemm_activation_fusion: result differs. Dataset:/usr/share/onnxruntime/test/testdata/transform/gemm_activation_fusion/test_data_set_0