Python wrapper performance issue

Pierrick-Pochelu commented 2 years ago

Hello, I start with a ResNet50 onnx file or a VGG19 ONNX file. I compile it to generate a SO file. Everything is working fine.

Now I want perform two things : 1) load/initialize the model in memory and 2) Fast predict with it when data samples are incoming.

I implement it copy/pasting the doc but I am hurting critical performance issue. The initiIialization step is very fast (too much <0.01sec), and the prediction is very slow compared to other inference framework (several minutes). I suspect ExecutionSession constructor to not loading and initializing the model, and the run method to load and run it instead of only predict.

How can I load/initialize once and perform fast predictions each time new data are incoming ?

import numpy as np
from PyRuntime import ExecutionSession
import time
path='../models_lib/ONNX/VGG19_O3.so'
batch_data=np.random.uniform(0,1,(32,224,224,3)).astype(np.float32)

st=time.time()
session=ExecutionSession(path, "run_main_graph")
print("load and initialization time: ",time.time()-st) # < 0.01 seconds 

st=time.time()
y=session.run([batch_data])
print("prediction time: ",time.time()-st) # Take 678 seconds !!!!

AlexandreEichenberger commented 2 years ago

Your observations are correct, we do not have to do anything during loading/initialization because all of the constants, for example, are loaded directly from the binary. We currently don't optimize conv ops, which is the predominant performance component of ResNet and other conv based networks. We are working on it. If you are interested in contributing to the performance of ONNX-MLIR, we welcome the contributions from all.

kartikeyporwal commented 2 years ago

Hi @AlexandreEichenberger, Thanks for sharing those details.

I was recently exploring this project and tested the inference of the generated model. I tested efficientnet_b0 from torchvision.

On my machine which is old Intel i3 5th Gen,

Pytorch model takes 122 ms ± 32.3 ms per loop (mean ± std. dev. of 20 runs, 10 loops each)
Onnx model takes 64.2 ms ± 10.5 ms per loop (mean ± std. dev. of 20 runs, 10 loops each)
Onnx-MLIR model takes 1.2 s ± 55 ms per loop (mean ± std. dev. of 20 runs, 10 loops each)

I'm wondering if I'm doing something wrong or this is the expected inference speed on the mentioned model?

On what kind of models is this project supposed to be used to get the faster inference?

Thanks

onnx / onnx-mlir

Python wrapper performance issue #1101