onnx / models

A collection of pre-trained, state-of-the-art models in the ONNX format
http://onnx.ai/models/
Apache License 2.0
7.79k stars 1.39k forks source link

ArcFace model query #156

Open hariharans29 opened 5 years ago

hariharans29 commented 5 years ago

Hello,

Was setting the spatial attribute to 0 in the BatchNormalization nodes of the ArcFace intended ? A user notes that setting spatial=1 returns the right result as well. So trying to understand if setting spatial = 0 (the non-default value) for the opset 8 model an accident.

CC: @abhinavs95

Thanks!

pranavsharma commented 5 years ago

@abhinavs95 Any update on this?

abhinavs95 commented 5 years ago

Hi @hariharans29 @pranavsharma

The ArcFace model was prepared using MXNet and then converted to ONNX format using the MXNet to ONNX converter.

For BatchNorm, MXNet computes mean and variance per feature which is why we explicitly set spatial=0 when translating BatchNorm layers from MXNet to ONNX.

pranavsharma commented 5 years ago

@abhinavs95 can this model be updated to use spatial=1? The ONNX standard has dropped support for spatial=0 from opset10 onwards and onnxruntime doesn't plan to support this.

abhinavs95 commented 5 years ago

The spatial parameter is set to 0 in the MXNet to ONNX converter probably due to behavior of MXNet batchnorm: https://github.com/apache/incubator-mxnet/blob/745a41ca1a6d74a645911de8af46dece03db93ea/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py#L357

I'll try to see if this model can be converted with spatial=1.

pranavsharma commented 5 years ago

@abhinavs95 did you get a chance to address this? Thanks!

pranavsharma commented 5 years ago

@abhinavs95 any update on this? Thanks!

abhinavs95 commented 5 years ago

@pranavsharma changing the spatial parameter cannot be done using the mxnet to onnx converter API as I had hoped, it requires modification of the source code. I am currently busy focussing on another project, I will provide an update when I get a chance to work on this.

arsdragonfly commented 5 years ago

@abhinavs95 any update on this? onnxruntime does not (and possibly will not) support spatial==0 on its CPU provider, making tensorrt-inference-server unable to load exported models (see here).

mathisdon commented 5 years ago

There are more models on the ONNX model zoo with this bug: Yolov3 and Duc are also non-usable by ONNX Runtime for the same reason. When will this be fixed?

prasanthpul commented 5 years ago

Yolov3 is not impacted by this and has been successfully tested as-is.

Duc and ArcFace models need to be updated to a newer ONNX version. Hopefully @abhinavs95 can make the necessary modifications soon.

Mut1nyJD commented 5 years ago

Just to reiterate on this, even on GPU backend with ONNXRuntime (v0.4 or v0.5) the current model in the repository is producing wrong results the feature vector returned from the final fc layer are always NaN. I strongly suggest retiring this model and maybe replace it by a PyTorch version of the same thing until MXNet updates their ONNX exporter to latest specification

17702513221 commented 5 years ago

@pranavsharma please tell me how to use yolov3(keras-to-onnx),I use it in tensorrt-inference-service get lots nan.

prasanthpul commented 5 years ago

@Mut1nyJD can you contribute a replacement model please?

Mut1nyJD commented 5 years ago

@prasanthpul

Working on it currently training a new version using a PyTorch implementation (model seems to export into ONNX in general) from scratch with Ms1m dataset, But this is going to take a while since I have it on low priority.

luan1412167 commented 5 years ago

@Mut1nyJD Have you a runable arcface model?

Mut1nyJD commented 4 years ago

@luan1412167

I am afraid I am stll training I have it at low priority that's why it takes time. Hopefully soon. I will check if an intermediate snapshot is exportable but I don't see why not.

luan1412167 commented 4 years ago

@Mut1nyJD whether arcface from pytorch to onnx have get right result

hariharans29 commented 4 years ago

Hi guys,

I think I finally cracked this issue. I supported non-spatial mode in ORT in this PR - https://github.com/microsoft/onnxruntime/pull/2092 but it still won't run the ArcFace model in the ONNX zoo.

This is because the ArcFace modelis an invalid ONNX model because it violates the ONNX spec (https://github.com/onnx/onnx/blob/master/docs/Changelog.md#BatchNormalization-7). It has BatchNorm nodes with spatial == 0 but the input shapes don’t adhere to the required shape.

The spec says that the input shape should be ( C, D1, D2,…, Dn) for the inputs when spatial == 0:

image

But, in the model it has shape – [C]. This is only allowed for spatial == 1.

image

So, supporting non-spatial mode in ORT will not solve this problem. This is a bug in the MXNet exporter wherein it actually means spatial == 1 but still stamping the BatchNormalization node with spatial == 0. The output results are correct when we run the model assuming spatial == 1. So, the model doesn’t need re-conversion, it only needs an update in the model proto to make spatial == 1 in all the BN nodes and it will run correctly in ORT.

luan1412167 commented 4 years ago

@hariharans29 you can try with this model here. this has spatial=0 and reshaped. I look forward to the result from you.

hariharans29 commented 4 years ago

Hi @luan1412167 - I actually think it should be the opposite (spatial == 1).

hariharans29 commented 4 years ago

Hi all,

I wrote a simple script to "correct" (not re-convert from base model) the ONNX model zoo ArcFace model from here - https://github.com/onnx/models/tree/master/vision/body_analysis/arcface.

This link contains the model (named resnet100.onnx) and test data. The script to correct the model is this (it is not possible to attach the corrected model as the size exceeds allowed limits)-

import onnx

model = onnx.load(r'arcface_mxnet\resnet100.onnx')

for node in model.graph.node:
    if(node.op_type == "BatchNormalization"):
        for attr in node.attribute:
            if (attr.name == "spatial"):
                attr.i = 1

onnx.save(model, r'updated_resnet100.onnx')

I checked the results in ONNXRuntime (using the test data provided in the same link) after correction and the result looks okay. Please use the corrected model if you have immediate inferencing needs.

prasanthpul commented 4 years ago

@abhinavs95 can you comment on @hariharans29's findings and how this can be fixed from MXNet side? If it cannot, then we should apply the correction to the downloadable model and eventually replace it with a model from another framework.

luan1412167 commented 4 years ago

@hariharans29 I used updated_resnet100.onnx as your instruction above. Though it is run but the result seem to be wrong. Whether you can check again result model run on python and onnx runtime?

hariharans29 commented 4 years ago

Did you use the official resnet100.onnx from the model zoo link or your converted model to make the update ?

I made the update on the official model and ran the test with all 3 test cases and the results are right.

As a double confirmation, another user made the same observation that making spatial == 1 in the same model here - https://github.com/Microsoft/onnxruntime/issues/831.

Quoting him - "By now I figured out that the model works correctly if you change the "spatial" attribute of all BatchNormalization nodes from 0 to 1. However, I'm not really sure why that helps".

I just gave an explanation above as to why that helps.

mathisdon commented 4 years ago

I just downloaded the arcface model again from https://github.com/onnx/models/tree/master/vision/body_analysis/arcface, using the link called "248.9 MB" in the "Download" column, and ONNX Runtime still reports the same problem:

RuntimeError: [ONNXRuntimeError] : 1 : GENERAL ERROR : Exception during initialization: D:\3\s\onnxruntime\core/providers/cpu/nn/batch_norm.h:39 onnxruntime::BatchNorm::BatchNorm spatial == 1 was false. BatchNormalization kernel for CPU provider does not support non-spatial cases

hariharans29 commented 4 years ago

Hi @mathisdon ,

The model doesn't require spatial == 0. Can you please make the update to the model as suggested above and try running it ?

luan1412167 commented 4 years ago

Hi @hariharans29, I have downloaded model from model zoo and run your script to change spatial 0->1. this is model link here I tried with 2 different images but I get cosine distance = 0.96. So I think it is wrong( Because 2 different images must to get cosine distance of embbeding is small). Can you share script evaluate the model? images Tom_Hanks_54745

hariharans29 commented 4 years ago

Hi,

I did not use a script. I used the onnx test runner tool in the OnnxRuntime repo. It has the capability to consume input tensor protobufs and output tensor protobufs and compare results after tests. I downloaded the 3 test cases in the on x model zoo link (download with test data) and used the onnx test runner tool to run each test case and the output is correct.

What is the exact numerical cosine distance value you expect ? The definition of "wrong" results seems ridden with some hidden assumptions.

luan1412167 commented 4 years ago

@hariharans29 can you check my model? here

I compute consine distance between two embbedings. if those two embbedings is a person that consine distance will near 1 opposite cosine distance will small and near 0.

hariharans29 commented 4 years ago

Hi,

I think it is the exact opposite. It is cosine "distance" (not similarity). When two people are different, cosine distance will near 1 and when they are the same, the value nears 0.

luan1412167 commented 4 years ago

Hi @hariharans29 My script for cosine similarity `def preprocess(input_data):

img_data = input_data.astype('float32')
img_data = img_data.reshape(1, 3, 112, 112)

mean_vec = np.array([0.485, 0.456, 0.406])
stddev_vec = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype('float32')
for i in range(img_data.shape[0]):
    norm_img_data[i,:,:] = (img_data[i,:,:]/255 - mean_vec[i]) / stddev_vec[i]

return norm_img_data

sess = rt.InferenceSession("/home/luandd/CLionProjects/untitled/updated_resnet100.onnx") input_name = sess.get_inputs()[0].name label_name = sess.get_outputs()[0].name

img = cv2.imread('/home/luandd/Downloads/trump-1.jpg') img = cv2.resize(img,(112,112)) input_data = preprocess(img)

a = sess.run([label_name], {input_name: input_data})[0]

img = cv2.imread('/home/luandd/Downloads/barack-obama.jpeg') img1 = cv2.resize(img,(112,112)) input_data = preprocess(img1) b = sess.run([label_name], {input_name: input_data})[0]

cos_sim = dot(a[0], b[0])/(norm(a[0])*norm(b[0])) print(cos_sim) ` I have tested 2 images above cos_sim = 0.9901112. I don't know why!!

leewea commented 4 years ago

@hariharans29 you can try with this model here. this has spatial=0 and reshaped. I look forward to the result from you.

can you tell me how to reshape it? thank you

duonglong289 commented 4 years ago

@hariharans29 you can try with this model here. this has spatial=0 and reshaped. I look forward to the result from you.

can you tell me how to reshape it? thank you

The root cause is as @hariharans29 said in this. I found this link change the function convert PReLU from mxnet to onnxruntime. It can fix this bug of mxnet converter. After that, the model exported from MxNet with BatchNorm might not run because the "spatial=0" in BatchNormalization. Following this link

Another way, i wrote a script for convert the exported model from mxnet to onnx to add a Reshape layer before BatchNormalization layer and it works for me.

import onnx
from onnx import checker
import logging

model = onnx.load(r"mxnet2onnx_exported_bug_model.onnx")
onnx_processed_nodes = []
onnx_processed_inputs = []
onnx_processed_outputs = []
onnx_processed_initializers = []

reshape_node = []

for ind, node in enumerate(model.graph.node):
    if node.op_type == "PRelu":
        input_node = node.input
        input_bn = input_node[0]
        input_relu_gamma = input_node[1]
        output_node = node.output[0]

        input_reshape_name = "reshape{}".format(ind)
        slope_number = "slope{}".format(ind)

        node_reshape = onnx.helper.make_node(
            op_type="Reshape",
            inputs=[input_relu_gamma, input_reshape_name],
            outputs=[slope_number],
            name=slope_number
        )

        reshape_node.append(input_reshape_name)
        node_relu = onnx.helper.make_node(
            op_type="PRelu",
            inputs=[input_bn, slope_number],
            outputs=[output_node],
            name=output_node
        )
        onnx_processed_nodes.extend([node_reshape, node_relu])

    else:
        # If "spatial = 0" does not work for "BatchNormalization", change "spatial=1"
        # else comment this "if" condition
        if node.op_type == "BatchNormalization":
            for attr in node.attribute:
                if (attr.name == "spatial"):
                    attr.i = 1
        onnx_processed_nodes.append(node)

list_new_inp = []
list_new_init = []
for name_rs in reshape_node:
    new_inp = onnx.helper.make_tensor_value_info(
        name=name_rs,
        elem_type=onnx.TensorProto.INT64,
        shape=[4]
    )
    new_init = onnx.helper.make_tensor(
        name=name_rs,
        data_type=onnx.TensorProto.INT64,
        dims=[4],
        vals=[1, -1, 1, 1]
    )

    list_new_inp.append(new_inp)
    list_new_init.append(new_init)

for k, inp in enumerate(model.graph.input):
    if "relu0_gamma" in inp.name or "relu1_gamma" in inp.name: #or "relu_gamma" in inp.name:
        new_reshape = list_new_inp.pop(0)
        onnx_processed_inputs.extend([inp, new_reshape])
    else:     
        onnx_processed_inputs.extend([inp])

for k, outp in enumerate(model.graph.output):
    onnx_processed_outputs.extend([outp])

for k, init in enumerate(model.graph.initializer):
    if "relu0_gamma" in init.name or "relu1_gamma" in init.name:
        new_reshape = list_new_init.pop(0)
        onnx_processed_initializers.extend([init, new_reshape])
    else:
        onnx_processed_initializers.extend([init])

graph = onnx.helper.make_graph(
        onnx_processed_nodes,
        "mxnet_converted_model",
        onnx_processed_inputs,
        onnx_processed_outputs
    )

graph.initializer.extend(onnx_processed_initializers)

# Check graph
checker.check_graph(graph)

onnx_model = onnx.helper.make_model(graph)

# Write model
str_input = '3,112,112'
input_shape = (1,) + tuple( [int(x) for x in str_input.split(',')] )
onnx_file_path = "mxnet2onnx_model_onnxruntime.onnx"

with open(onnx_file_path, "wb") as file_handle:
    serialized = onnx_model.SerializeToString()
    file_handle.write(serialized)
    logging.info("Input shape of the model %s ", input_shape)
    logging.info("Exported ONNX file %s saved to disk", onnx_file_path)

print("Done!!!")
SthPhoenix commented 4 years ago

Awesome work @duonglong289 ! I was able to convert ONNX model zoo ArcFace using onnxsimplifier and @hariharans29 script to TensorRT 7, but was struggling for two days to convert original InsightFace model, and your script helped in conversion.

SthPhoenix commented 4 years ago

If someone interested, I have made a script to convert original InsightFace model zoo ArcFace to ONNX and than to TensorRT, based on @duonglong289 script.

TensorRT outputs are same with MXNet outputs, so it can be a drop in replacement for MXNet model. TRT inference code needs some cleanup and will be released later.

NaeemKhan333 commented 3 years ago

@SthPhoenix @duonglong289 @hariharans29 How I can convert the model Onnx ---> TensorRT , given in the following link

https://github.com/onnx/models/tree/master/vision/body_analysis/arcface/model

Please guide me about it . Thanks

SthPhoenix commented 3 years ago

@NaeemKhan333 , if you not so strict about specific ArcFace version, you could try using my converter, which builds TRT engine from original ArcFace model, which gives better accuracy than model provided in ONNX model zoo. To use it you'll need docker, nvidia-container-toolkit and nvidia 450.xx drivers. To build the engine you'll need:

  1. Clone the repo. git clone https://github.com/SthPhoenix/InsightFace-REST.git.
  2. Deploy conversion container: bash deploy_converter.sh
  3. Inside container shell execute script: python build_insight_trt.py

As result you'll get folder models inside repo's root, containing original MXNet model, same model converted to ONNX, and finally .plan file containing serialized TensorRT engine.

Engine will be built using TensorRT 7.1.3, if you can't use 450.xx drivers, you can edit src/Dockerfile.converter to use TensorRT:20.03 image instead of 20.09. You'll get TensorRT 7.0, which is not recommend.

NaeemKhan333 commented 3 years ago

@SthPhoenix Thanks for such nice a nice explanation. But I do not want to use docker, I want to converter without using docker.Can you guide me how I can use it.Thanks

SthPhoenix commented 3 years ago

@SthPhoenix Thanks for such nice a nice explanation. But I do not want to use docker, I want to converter without using docker.Can you guide me how I can use it.Thanks

Than you can just run build_insight_trt.py, but than you need to manually install mxnet==1.7.0, onnx==1.7.0, tensorrt>=7.0.0, and CUDA, cuDNN, compatible with you graphic driver

NaeemKhan333 commented 3 years ago

@SthPhoenix ok got it, and secondly should I need to use this model https://github.com/onnx/models/tree/master/vision/body_analysis/arcface/model or I need to download original MXNet model. can you guide me .Thanks

SthPhoenix commented 3 years ago

Script will do everything for you, but will use LResNet100E-IR,ArcFace@ms1m-refine-v2 model from Insightface model zoo

NaeemKhan333 commented 3 years ago

@SthPhoenix Do you have a Script or can guide me about the inference/testing script on input image for converted tensorrt model after running the conversion script. Thanks in advance

NaeemKhan333 commented 3 years ago

Script will do everything for you, but will use LResNet100E-IR,ArcFace@ms1m-refine-v2 model from Insightface model zoo

mxnet version: 1.6.0 onnx version: 1.7.0 Model file is not found. Downloading. Downloading /models/mxnet/arcface_r100_v1.zip from http://insightface.ai/files/models/arcface_r100_v1.zip... 100%|████████████████████████████████| 237710/237710 [00:13<00:00, 17061.59KB/s] Converting MXNet model to ONNX... Creating intermediate copy of source model... Applying RetinaFace specific fixes to input MXNet model before conversion... Exporting to ONNX... [08:27:16] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade... [08:27:16] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded! Applying ArcFace specific fixes to output ONNX Removing initializer from inputs in ONNX model... Removing intermediate .symbol and .params Building TensorRT engine... [TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. TensorRT model ready.

@SthPhoenix It does not save the converted model into my directory.why it so?

SthPhoenix commented 3 years ago

By default it's model location is /model dir, since I'm mounting it inside docker, you can check this dir at disk root, or just change this path in build_insight_trt.py

BTW: I think this conversation is already out of scope of original issue, feel free to open an issue at my repo if you got any problems.

nadaboulares commented 3 years ago

hi guys anyone send me example of testing recognition onnx model