Mobilenetv2 classification result is wrong

yizhaoyanbo commented 5 years ago

Describe the bug I have trained a mobilenetv2 classification model using pytorch1.1 and exported to onnx model. I have tried to do the prediction using opencv4's dnn module and the result is correct. But I got the wrong result using onnxruntime C++ CPU api with the same image and same preprocessing.

Urgency none

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows10
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: v0.5.0
Python version: none
Visual Studio version (if applicable): vs2019
GCC/Compiler version (if compiling from source): none
CUDA/cuDNN version: none
GPU model and memory:none

To Reproduce

load the onnx model
preprocessing input image (convert to float, normalized with mean and std)
do the prediction (input rgb format (3x224x224), output tensor with 4 values)
get the output tensor and do softmax to obtain the final result

Expected behavior I set the "bike.jpg" as input image. The correct result is "arm" label with confidence 0.99. But the onnxruntime's result is "arm" label with confidece 0.0008.

Screenshots I have put the model, label file and result's screenshots to following .zip file.

Additional context resources.zip

Thanks.

faxu commented 5 years ago

Have you tried exporting the model from PyTorch 1.2? CC @spandantiwari

yizhaoyanbo commented 5 years ago

Have you tried exporting the model from PyTorch 1.2? CC @spandantiwari

I have updated pytorch version from 1.1 to 1.2 and re-exported onnx model. But onnxruntime crashed when implementing the session.Run() api:

        // score model & input tensor, get back output tensor
        output_tensors = session.Run(Ort::RunOptions{ nullptr }, input_node_names.data(), &input_tensor, 1, output_node_names.data(), 1);

The onnx model exported from pytorch1.2 is in the attach zip files model_pytorch12.zip

hariharans29 commented 5 years ago

Can you please share your full scoring program ? How does input_tensor look like ? What shape is it ? What pre-processing is required for this model ? Did ORT return any error message and fail while running Run() ?

As an aside, could you please try running with the 1.0 release to see if some bug that was fixed as part of the release solves this issue ? Thanks!

yizhaoyanbo commented 5 years ago

sample_classify.zip

I have update pytorch to version 1.3 and onnx version 1.6.0. I re-exported the onnx model. It crashed at the following code when using the onnxruntime v1.0 cpu version:

    printf("Using Onnxruntime C++ API\n");
    Ort::Session session(env, model_path, session_options); //!!!! crashed here

I have put the onnx model and sample code to the zip file.

pytorch version:1.3 onnx version:1.6.0 onnxruntime version: 1.0 cpu version op: windows 10

hariharans29 commented 5 years ago

Thanks for sharing. I will take a look.

hariharans29 commented 5 years ago

Where can I get bike.jpg ? truth.jpg won't work as it has noise apart from bike.jpg...

Also, as a simple sanity check, I tried loading the latest model with all optimizers enabled (just like your C++ code) and creating session in Python and the model loads fine. Let me see if I can get the right results in Python when you share bike.jpg and then we can see what happens in C++...

hariharans29 commented 5 years ago

Also - did you try importing the same ONNX model in Open CV DNN module too? If the same model ONNX got imported via Open CV and yields different results than ORT, then it may be an ORT issue. Otherwise, this could be a model export issue. I saw the model, it seems pretty simple - not too many layers, so I am wondering if this could a PyTorch export isue wherein the graph is not structurally correct...

hariharans29 commented 5 years ago

I found bike.jpg in your initial resources zip. I tried it along with ORT 1.0 and the latest model you shared, I could get repro the original issue - other label has a softmax score of 1.0. Here is my script -

ORT 1.0

import onnxruntime as rt
import numpy as np
from PIL import Image

def preprocess(image):
    image = image.resize((224, 224), Image.BILINEAR)

    # Convert to BGR
    image = np.array(image)[:, :, [2, 1, 0]].astype('float32')

    # HWC -> CHW
    image = np.transpose(image, [2, 0, 1])

    # Normalize
    mean_vec = np.array([0.485, 0.456, 0.406])
    std_vec = np.array([0.229, 0.224, 0.225])
    for i in range(image.shape[0]):
        image[i, :, :] = image[i, :, :] - mean_vec[i]
        image[i, :, :] = image[i, :, :] / std_vec[i]

    image = np.expand_dims(image, axis=0)    
    return image

def softmax(arr):
    sum = 0
    max = np.max(arr)
    print(max)
    for i in arr:
        print(i)
        sum += np.exp(i - max)    # subtract max in numr and denr to avoid overflow

    res = []
    for i in arr:
        res.append(np.exp(i - max) / sum)

    return res    

so = rt.SessionOptions()
so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL

img = Image.open(r'bike.jpg')
img_data = preprocess(img)
print (img_data.shape)

sess = rt.InferenceSession(r'mobilenetv2_pov4_test.onnx', so)
input_name = sess.get_inputs()[0].name

pred_onnx = sess.run(None, {input_name: img_data})
print(pred_onnx)

print(softmax(pred_onnx[0][0]))

Could you please share your training PyTorch script and let's see if the model got exported right to ONNX via the PyTorch exporter ? Also, please let us know if you were able to load the same ONNX model in Open CV 4 and get right results, in which case, the issue probably lies with ORT? Thanks.

CC: @spandantiwari @BowenBao

yizhaoyanbo commented 5 years ago

thanks @hariharans29

I have tried to load the same onnx model using opencv4.1.0 dnn module and get the right results.
The python scripts to export onnx model are in the following zip file, both with mobilenetv2 base model and the trained classify model. python_model_and_scripts.zip

hariharans29 commented 4 years ago

Great, thanks. Can you also please share how you invoke OpenCV 4's dnn module to consume the ONNX file? Might make it faster for me....

yizhaoyanbo commented 4 years ago

I have put the sample code files using OpenCV4.1 dnn module in the following zip file. The onnx file is loaded in the construction function and the prediction is implemented in the predict() function. opencv_classify.zip

yizhaoyanbo commented 4 years ago

@hariharans29 Does ORT will fix this bug in next version? thanks.

hariharans29 commented 4 years ago

Hi @yizhaoyanbo,

I am working on some stuff right now for the next release. Haven't had time to debug this yet and identify a cause for the divergence. I hope to get to this soon. I think the difficult part is to identify cause for the divergence. The fix (I think) will be of low dev cost. I will keep you updated.

Thanks for checking back.

hariharans29 commented 4 years ago

Hi @yizhaoyanbo,

I took a look at it briefly today - I am still not able to narrow down the possible causes. Meanwhile, I noticed that the exported ONNX model was an opset 9 model. Is it possible to try exporting to a newer opset from PyTorch (opset 10 or opset 11) and give it a try again ?

faxu commented 4 years ago

@yizhaoyanbo Were you able to get this working with the latest ORT release and using latest version of PyTorch? let us know if more assistance is needed

stale[bot] commented 4 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

stale[bot] commented 4 years ago

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

microsoft / onnxruntime

Mobilenetv2 classification result is wrong #2150