Some networks running extremely slow

Ultrawipf commented 6 years ago

I currently need to compare different image classification networks and with the onnx-tf backend some CNNs like resnet50, googlenet or alexnet run extremely slow from the onnx model page. It takes multiple seconds for a single image (1, 3, 224, 224) to process where it should take milliseconds, which seems feels around 1000x slower than running the same network from a direct tensorflow implementation converted from a caffe model on cpu and gpu runs. The results are correct and the network does work so is assume the input and output data is correct. Squeezenet on the other hand runs perfectly fine and fast with onnx-tf. For comparison i tried to convert the "good" networks from tensorflow to onnx but that fails because of unsupported rsqrt operations so there are definetly some architecture differences.

Current tensorflow version is 1.10.0 and onnx ist 1.2.2.

It would be great to hear about possible causes and fixes.

tjingrant commented 6 years ago

Would you share please your code, methodology and timing results in benchmarking onnx-tf? And what's your usecase? Are you trying to serve model with onnx-tf?

onnx-tf is not designed to be super fast because once an onnx model goes through onnx-tf, it becomes a tf graph and we expose this tf graph to the user, c.f. our API: https://github.com/onnx/onnx-tensorflow/blob/master/doc/API.md , everything from here becomes natively Tensorflow.

Regardless, I suspect protobuf deserialization takes up a big chunk of the time. Also, I'm not sure what's the impact of format restriction (NHWC, NCHW, etc). Need to see the actual profiling results to be sure.

Ultrawipf commented 6 years ago

Of course i will share an example to reproduce what i mean. I am working on comparisons of weight compressions and formats where i need simple access to weights and feed a lot of test data to different networks and chose onnx for the good documentation and cross framework features. The thing that takes a long time is not the decoding of the onnx protobuf and graph generation by the prepare method but the actual execution of the tensorflow session. I want to run a lot of test images through the networks and strangely as soon as i take the graph from the prepared representation In my program i am exporting the graph from the tf_rep to a new persistent session so i don't have to use the run method for batch evaluation like this and allow different gpu options: tf.import_graph_def(tf_rep.predict_net.graph.as_graph_def(),name="") But that is the only major difference to the examples and should not be the reason because it is also observable when using a simple program like this where you can replace the sess.run with the tf_rep.run function and run into the same extremely slow execution times:

import onnx
from onnx_tf.backend import prepare
import tensorflow as tf

# Prepare the inputs, here we use numpy to generate some random inputs for demo purpose
import numpy as np
img = np.random.randn(1, 3, 224, 224).astype(np.float32)

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config,graph=None)

# Load the ONNX model
print("Loading")
model = onnx.load('tmp/models/googlenet.onnx')
tf_rep = prepare(model)

tf.import_graph_def(tf_rep.predict_net.graph.as_graph_def(),name="")

print("Running") # this should run fast
for i in range(10):
    output = sess.run("Softmax:0", feed_dict = {"data_0:0": img})#tf_rep.run(img)
    print(output)

For example Squeezenet runs perfectly fine. But googlenet and alexnet take a very long time.

While testing the simple test program i found the warning:

UserWarning: Using the pooling op in compatibility mode.This means your graph cannot be serialized.Please configure your pooling operation to only use paddings that correspond to Tensorflow SAME or VALID padding.

Maybe this is already a hint for me that something is not fully compatible in these models for tensorflow. I also tested the strict option in True and False with no difference.

tjingrant commented 6 years ago

I ran a chrome trace and it points to the compatibility pooling op as the culprit. Some padding configuration in PyTorch and Caffe are not natively supported in Tensorflow, thus we implemented them in python to ensure you can at least run the model precisely with the same original configuration. These ops are executed in single threaded mode in Python; no doubt they are slow.

Using strict=False solves the problem. Refer to our API documentation for more info.

I modified your code:

import onnx
from onnx_tf.backend import prepare
import tensorflow as tf
from tensorflow.python.client import timeline
import time

# Prepare the inputs, here we use numpy to generate some random inputs for demo purpose
import numpy as np
img = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Load the ONNX model
print("Loading")
model = onnx.load('googlenet.onnx')
tf_rep = prepare(model, strict=False)

print("Running") # this should run fast
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config, graph=None)

tf.import_graph_def(tf_rep.graph.as_graph_def(),name="")

start = time.time()

for i in range(10):
   output = sess.run("Softmax:0", feed_dict = {"data_0:0": img}, options=run_options, run_metadata=run_metadata)#tf_rep.run(img)
   print(output)

end = time.time()
print("time elapsed:")
print(end - start)

tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('trace_file.json', 'w') as f:
    f.write(ctf)

Now 10 inference runs take 2.7 sec on my machine (Power9 + Volta).

Ultrawipf commented 6 years ago

Thanks for the tests. So it was indeed the pooling operation. I also tried to set strict=False before but it made no difference here and i got the same warning and slow speed. What is your current onnx-tf version? I checked the versions again because tf_rep.graph did not exist when it should. Turns out i had 1.1.2 installed via pip and not from source. Uninstalled that and installed from source and now strict=False does work for this issue. Now i need to verify the results again but it seems like this problem is solved now.

resnet50 now completely fails with strict=False while preparing but at least i know there is some compatibility issue. AlexNet and GoogleNet seem to produce reasonable results.

fumihwh commented 6 years ago

@Ultrawipf I get pb and test data from https://github.com/onnx/models/tree/master/models/image_classification/resnet With strict=False, the result is following:

v1:
  0 Passed
  1 Passed
  2 Passed
v2: 
  0 Failed
  1 Passed
  2 Passed

With strict=True, the result is following:

v1:
  0 Passed
  1 Passed
  2 Passed
v2: 
  0 Failed
  1 Passed
  2 Passed

test data set 0 failed just for low tolerance. I think this result is acceptable.

Ultrawipf commented 6 years ago

That sounds great. Tested some different versions of resnet again and all networks that had problems seem to run fine now. The one that failed with strict=False and ran with strict=True was downloaded a few days ago from the model page but might have been replaced since as the current one indeed runs correctly. I am sure onnx-tf is a good option in the future to run different models with tensorflow. Thanks for that.

vibhuagrawal14 commented 6 years ago

Hello. I am trying to use googlenet built in MATLAB exported to tensorflow through onnx. The same classification task that takes 5 seconds in MATLAB is taking about 300 seconds in tensorflow. I did try it with strict=False, and there was no difference whatsoever. Would you be able to help? Attaching code below:

import numpy as np
import scipy.io as sio
import scipy
import tensorflow as tf
import cv2

import onnx
from onnx_tf.backend import prepare
model = onnx.load('D:\Vibhu\googlenet9.onnx',)
tf_rep = prepare(model,strict=False)

mat_contents = sio.loadmat('WDS7PSPCF0S11.IM0.mat')
img=mat_contents['scene']

#Some code removed for readability but essentially imgarr is a collection of images, made using the matrix img

for i in range(img.shape[2]):

    img1=imgarr[i,:,:,:]
    img1 = np.moveaxis(img1, 0, -1)
    img1=cv2.resize(img1,(224,224))
    img1 = np.moveaxis(img1, -1,0)
    print(i+1,np.argmax(tf_rep.run(img1[np.newaxis,:,:,:]))+1)

tjingrant commented 6 years ago

@vibhuagrawal14 can you start a new issue and fill out some details like upload the models you used and let us know versions of onnx-tf and onnx you used?

Most likely reason is that you are using an outdated version of onnx-tf; but we'd have to see information required by our issue template.

vibhuagrawal14 commented 6 years ago

@tjingrant Yes, will do that.

onnx / onnx-tensorflow

Some networks running extremely slow #254