Slow inference with EfficientNet with dynamic batch size

Describe the bug Our objective is to port an EfficientNet trained in PyTorch and exported in ONNX to TensorFlow. However, we are facing the problem of slower inference times in the converted version.

After some experimentation, we found that this is related to the dynamic axis property that we use to allow arbitrary batch sizes on inference. From the Pytorch point of view, this is controlled with the dynamic_axes parameter. We exported and converted two versions of the same EfficientNet architecture, one with a fixed batch size of 1 and another with a dynamic batch size. Both models, in ONNX and as SavedModels can be found here.

Our results show that the version of the model with dynamic batch size is 50 times slower for the same input. Although we don't provide more models, we have replicated this issue in other architectures such as the MobileNet, but we didn't find it in architectures such as ResNet. Because of this, our guess is that the problem is related to the depthwise convolution (DepthWiseConv2D in Tensorflow, and Conv2d with groups == in_channels and out_channels = K * in_channels in PyTorch).

To Reproduce

from time import time

from essentia.standard import TensorflowPredict
from essentia import Pool
import numpy as np

def inference(model_name, pool):
    start = time()
    TensorflowPredict(
        savedModel=model_name,
        inputs=['serving_default_melspectrogram'],
        outputs=['PartitionedCall'],
    )(pool)
    print(f"{model_name} inference time: {time() - start:.1f}s")

pool = Pool()
pool.set('serving_default_melspectrogram', np.ones((1, 1, 128, 96), dtype='float32'))

inference("effnet_opset11_fixed_axis", pool)
inference("effnet_opset11_dynamic_axis", pool)

Produces the following output

...
[   INFO   ] Successfully loaded SavedModel: `effnet_opset11_fixed_axis`
effnet_opset11_fixed_axis inference time: 4.1s
[   INFO   ] Successfully loaded SavedModel: `effnet_opset11_dynamic_axis`
effnet_opset11_dynamic_axis inference time: 203.2s

Instructions to reproduce your problem

Install Essentia: pip install essentia-tensorflow
Download the models
Run the code above

ONNX model file Both the ONNX and SavedModel version os the models are available here.

Python, ONNX, ONNX-TF, Tensorflow version

Python version: 3.8
ONNX version: 1.7
ONNX-TF version: 1.9, built from the current HEAD of the master branch (3847bb0975ce2047d9915af6a35fc6f6a87eed25)
Tensorflow version: 2.6

Additional context Our final goal is to perform inference with TensorFlow models in Essentia, a cpp library with Python bindings relying on the TensorFlow C API. We would like to solve this issue, so that we can use arbitrarily large batches for GPU inference.

This PR is a follow-up to #923. In the past, we obtained faster (but still very slow) inference times with an old version of the converter based on TF1.X.

onnx / onnx-tensorflow

Slow inference with EfficientNet with dynamic batch size #998