[Mobile] same model returns different result between java and android enviroment

zhangjh commented 1 year ago

Describe the issue

I tried using onnxruntime to run chinese-clip model on android enviroment. Because i am familiar with java, So I written the code in java first. And the result is as the same as the official python version. When I transfer the code to android studio, I found that the result is different. I hava already checked the input tensor, code dependencies, all that same. Below is my core code fragments about onnx running.

        InputStream imgInput = this.getClass().getClassLoader().getResourceAsStream("img_quant.onnx");
        assert imgInput != null;
        OrtSession session = ortEnv.createSession(IOUtils.toByteArray(imgInput));
       ... ...
       Map<String, OnnxTensor> inputMap = new HashMap<>();
        String inputName = session.getInputNames().iterator().next();
        String outputName = session.getOutputNames().iterator().next();

        inputMap.put(inputName, inputTensor);
        OrtSession.Result result = session.run(inputMap);

the same input tensors: (saved the data and checked with vimdiff) android studio idea

different outputs: android studio idea

And the core dependencies are the same:

com.microsoft.onnxruntime:onnxruntime-android:1.15.0
org.nd4j:nd4j-native-platform:1.0.0-M2.1

Any thoughts or comments on this problem?

To reproduce

See above for details

Urgency

No response

Platform

Android

OS Version

android studio emulator arm64-v8a

ONNX Runtime Installation

Released Package

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

onnxruntime-android

ONNX Runtime Version or Commit ID

1.15.0

ONNX Runtime API

Java/Kotlin

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

wejoncy commented 1 year ago

Thanks for your reporting the issue.

Could you please share the model file here? Or a GoodDrive Link with Model? And the saved input_data/output_data as numpy format would be helpful to address this issue.

zhangjh commented 1 year ago

The image model The image which run with the model The inputTensor data run on java idea The inputTensor data run on the android studio The output data after running model on java idea The output data after running model on android studio Thank you for your help.

zhangjh commented 1 year ago

Add my build.gradle configuration below:

plugins {
    id 'com.android.application'
}

android {
    namespace 'me.zhangjh.smart.search'
    compileSdk 33

    defaultConfig {
        applicationId "me.zhangjh.smart.search"
        minSdk 26
        targetSdk 33
        versionCode 1
        versionName "1.0"
        resConfigs "zh","en"

        ndk {
            abiFilters 'armeabi-v7a', 'arm64-v8a'
        }

        testInstrumentationRunner "androidx.test.runner.AndroidJUnitRunner"
    }

    buildTypes {
        release {
            shrinkResources true
            minifyEnabled true
            proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
        }
    }
    compileOptions {
        sourceCompatibility JavaVersion.VERSION_1_9
        targetCompatibility JavaVersion.VERSION_1_9
    }
//    dexOptions {
//        dexInProcess true
//        preDexLibraries true
//        javaMaxHeapSize "6g"
//    }
    testOptions {
        unitTests.includeAndroidResources = true
        unitTests.all {
            useJUnitPlatform()
        }
    }
    packagingOptions {
        exclude 'META-INF/DEPENDENCIES'
        exclude 'META-INF/NOTICE'
        exclude 'META-INF/LICENSE'
        exclude 'META-INF/LICENSE.txt'
        exclude 'META-INF/NOTICE.txt'

        exclude 'META-INF/native-image/linux-x86/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/linux-x86_64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/linux-ppc64le/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/linux-arm64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/linux-armhf/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/android-x86_64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/android-x86/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/android-arm/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/android-arm64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/ios-x86_64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/ios-arm64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/windows-x86_64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/windows-x86/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/macosx-arm64/jnijavacpp/jni-config.json'
        exclude 'META-INF/native-image/macosx-x86_64/jnijavacpp/jni-config.json'

        exclude 'META-INF/native-image/linux-x86/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/linux-ppc64le/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/linux-x86_64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/linux-arm64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/linux-armhf/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/android-x86_64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/android-x86/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/android-arm/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/android-arm64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/ios-x86_64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/ios-arm64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/windows-x86_64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/windows-x86/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/macosx-x86_64/jnijavacpp/reflect-config.json'
        exclude 'META-INF/native-image/macosx-arm64/jnijavacpp/reflect-config.json'
    }
}

dependencies {

    implementation 'androidx.appcompat:appcompat:1.6.1'
    implementation 'com.google.android.material:material:1.9.0'
    implementation 'androidx.constraintlayout:constraintlayout:2.1.4'

    implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.15.1'
    implementation 'org.nd4j:nd4j-native:1.0.0-M2.1:macosx-arm64'
    implementation 'org.nd4j:nd4j-native-platform:1.0.0-M2.1'

    implementation group: 'com.alibaba', name: 'fastjson', version: '2.0.34'

    androidTestImplementation 'junit:junit:4.13.2'
    androidTestImplementation 'androidx.test:runner:1.5.2'
    androidTestImplementation 'androidx.test.ext:junit:1.1.5'

    testImplementation 'junit:junit:4.13.2'
    testImplementation("org.junit.jupiter:junit-jupiter-api:5.7.0")
}

core dependencies 'nd4j' and 'onnx' version are the same as in java project.

Young-Flash commented 1 year ago

I also encountered the same issue with chinese-clip, I had check the input tensor and the model byte[] in memory (android) to make sure they are same as (java), but the result is always different,(the java inference result is right), it confused me a long time, searched Google high and low but could not find the answer.

any update on this @zhangjh, @wejoncy ?

greyovo commented 1 year ago

Same here. But I found that, converting the onnx format to ort, and using *.with_runtime_opt.ort version, may bridge the result gap at a bit (though the difference are still observed, but acceptable).

See here and here ....

And I also observed that the quantized model may yield this problem while the original model would not.

zhangjh commented 1 year ago

I've forgotten how to solve the issue cause it passed a long time. Maybe we can overlook this issue because I found that it worked well in the Android environment, however I wonder why there are differences between Android and Java environments.

wejoncy commented 10 months ago

Hi @zhangjh @greyovo @Young-Flash I checked each op's output carefully, It's as the specialized optimization in different platform (Linux x86-64, Android a64/a32). The error is accumulated by layers. Does the error affect the final output in your end2end senario?

A simple solutions for now is to use NNAPI, which gives the same outputs as your python use-case. But it might be a bit slow.

If it ends up producing totally different outputs such as classification/detection, we will try to treak the Matmul Algorithm to improve the precision.

Young-Flash commented 10 months ago

I checked each op's output carefully, It's as the specialized optimization in different platform (Linux x86-64, Android a64/a32). The error is accumulated by layers.

is it Involve all the op in the model, or just a specify one?

Does the error affect the final output in your end2end senario?

yeah it does affect the final output in the end2end senario (text image match), make the result unacceptable.

If it ends up producing totally different outputs such as classification/detection, we will try to treak the Matmul Algorithm to improve the precision.

thanks in advance.

wejoncy commented 10 months ago

is it Involve all the op in the model, or just a specify one?

MatMulInteger is the culprit.

Potential saturation might be the root cause.

Please use NNAPI EP to work around it temporarily if it's urgent for your scenario.

Will update this if we figure out the solution.

microsoft / onnxruntime