MxNet Converted Arcface Model Slow Compared to Provided Arcface Model

I have a arcface/resnet100 model that I've trained using InsightFace's MxNet training. For inference, I have converted the model to ONNX with the help of https://github.com/linghu8812/tensorrt_inference/blob/master/project/arcface/export_onnx.py.

The inference results are correct on my converted model, however speed of the model is extremely slow. For reference, I compared with the arcface model provided in this repository (arcfaceresnet100-8.onnx). Inference when running my model takes ~7 seconds, whereas the other model takes < 1 second.

When comparing the two models in Netron, all of the nodes, attributes, input/output shapes are the same (weights are different, obviously), however when I run the onnx profiler on the two models, there are a few differences. I've attached the profile logs for both models.

profile_arcfaceresnet100-8.txt profile_model-opt.txt

There are a few differences in the two logs. Mainly (mine vs. arcfaceresnet100-8):

ReorderInput/ReorderOutput operations: 99 vs. 51
Conv operations: 103 vs. 152
BatchNormalization operations: 51 vs. 2
Avg. PRelu time: 995.8μs vs. 129.1μs
Avg. Conv time: 12079.6μs vs. 411.9μs

I am not sure what the differences are when converting. It is critical that I get my converted model to run with a similar performance as the arcfaceresnet100-8 model. I've tried running my model through simplifiers/optimizers/etc., but with no improvement.

Here are my environment details:

OS: Linux Ubuntu Server 20.04 Python: 3.8

MxNet version: 1.9.1 ONNXRuntime: 1.14.0 ONNX 1.13.0 ONNX IR Version: 8 ONNX Opset Version: 18

If anyone could provide insight as to why my model performs slower or why there are differences in execution, that would be extremely helpful.

After some further testing, I manually updated my model to match the IR and Opset version of the arcfaceresent100-8 model (IR version 3, Opset version 8), and that seems to have resolved the node differences in the profiles. I'm now seeing 51 ReorderInput/ReorderOutput, 152 Conv, 2 BatchNormalization, etc.

It is pretty clear now that the Conv and PRelu execution times are what is causing my model to be slow, however I still don't see any differences in those with the other model. One thing to note is that the weights of my model are significantly smaller (for example, -4.930378685479061e-25 vs. 0.00033268501283600926), but both are float32.

Not sure if the weight values could cause any slowdown, but struggling to find the differences in my Conv/PRelu nodes that could cause the slowdown.

onnx / models

MxNet Converted Arcface Model Slow Compared to Provided Arcface Model #591