Calculation error and performance hit between dilated 2d convolutions (TF -> ONNX -> TF)

Describe the bug

There is a calculation error and performance hit when exporting certain dilated 2d convolutions from tensorflow to ONNX and then evaluating that ONNX in tensorflow again.

It seems the problem stems from a difference in calculation between tf.nn.conv2d and tf.nn.convolution. These give different results on the same hardware (the error is small, but increases in bigger networks). Another thing is that tf.nn.convolution seems to be less performant than tf.nn.conv2d. Right now tf.nn.convolution is used in the conversion from ONNX -> TF, however, I believe using the specific functions (tf.nn.conv1/2/3d) is more appropriate as they give no calculation error when going TF -> ONNX -> TF and are faster.

In addtion to that, when using tf.nn.convolution the ONNX contains a lot more ops than when using tf.nn.conv2d. However, this is a problem of tensorflow-onnx. It's good to be aware of this however.

To reproduce

import os
import numpy as np
import onnx
import tensorflow as tf
from onnx_tf.backend import prepare
import tf2onnx

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TF_CUDNN_USE_AUTOTUNE"] = "0"
os.environ["TF_CUDNN_DETERMINISTIC"] = "1"
os.environ["TF_DETERMINISTIC_OPS"] = "1"

@tf.function
def f(x):
    with tf.device('/GPU:0'):
        random_state = np.random.RandomState(42)
        weights = tf.constant(random_state.randn(3, 3, 1, 1).astype(np.float32))

        a = tf.nn.conv2d(x, weights, strides=1, dilations=9, padding="VALID", data_format="NCHW")
        b = tf.nn.convolution(x, weights, strides=1, dilations=9, padding="VALID", data_format="NCHW")

    return a, b

random_state = np.random.RandomState(42)
inputs = random_state.randn(1, 1, 128, 128).astype(np.float32)

expected_a, expected_b = [output.numpy() for output in f(inputs)]

onnx_model = tf2onnx.convert.from_function(
    function=f,
    input_signature=(
        tf.TensorSpec(shape=(1, 1, 128, 128), dtype=tf.float32, name="x"),
    ),
    opset=11,
)[0]

model = prepare(onnx_model, device="CUDA:0")

actual_a, actual_b = model.run([inputs])

print()
print(onnx.helper.printable_graph(onnx_model.graph))
print()
print(np.allclose(expected_a, expected_b), np.max(np.abs(expected_a - expected_b)))  # False 2.1457672e-06
print(np.allclose(expected_a, actual_a), np.max(np.abs(expected_a - actual_a)))      # False 2.1457672e-06
print(np.allclose(expected_b, actual_a), np.max(np.abs(expected_b - actual_b)))      # True 0.0

ONNX model file

If you want it, I can give it, or you can save it from my reproduce.

Python, ONNX, ONNX-TF, Tensorflow version

This section can be obtained by running get_version.py from util folder.

Python version: 3.8.7
ONNX version: 1.8.1
ONNX-TF version: 1.8.0
TF-ONNX (tf2onnx) version: 1.8.4
Tensorflow version: 2.3.2

Additional context

GPU used: GTX 1080 ti

Additional speed test of the convolution functions

import os
import numpy as np
import tensorflow as tf
import time

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TF_CUDNN_USE_AUTOTUNE"] = "0"
os.environ["TF_CUDNN_DETERMINISTIC"] = "1"
os.environ["TF_DETERMINISTIC_OPS"] = "1"

@tf.function
def f(x):
    with tf.device('/GPU:0'):
        random_state = np.random.RandomState(42)
        weights = tf.constant(random_state.randn(3, 3, 16, 16).astype(np.float32))

        return tf.nn.conv2d(x, weights, strides=1, dilations=9, padding="VALID", data_format="NCHW")

@tf.function
def g(x):
    with tf.device('/GPU:0'):
        random_state = np.random.RandomState(42)
        weights = tf.constant(random_state.randn(3, 3, 16, 16).astype(np.float32))

        return tf.nn.convolution(x, weights, strides=1, dilations=9, padding="VALID", data_format="NCHW")

concrete_f = f.get_concrete_function(tf.TensorSpec(shape=(100, 16, 128, 128), dtype=tf.float32, name="x"))
concrete_g = g.get_concrete_function(tf.TensorSpec(shape=(100, 16, 128, 128), dtype=tf.float32, name="x"))

random_state = np.random.RandomState(42)
inputs = random_state.randn(100, 16, 128, 128).astype(np.float32)

start = time.perf_counter()
output_f = concrete_f(inputs)
end = time.perf_counter()
print(end - start)  # 0.06361602060496807

start = time.perf_counter()
output_g = concrete_g(inputs)
end = time.perf_counter()
print(end - start)  # 0.07036794442683458

Sorry for the late reply. Took some time to get a proper env to try it out. The results are quite different for yours though.

The calculation error doesn't seem to occur, as seen in the output True 0.0 True 0.0 True 0.0

The performance is very interesting. The first run is always super long in my env. So I changed the code to do the initial run and count next 100 times for the average.

# don't count the first run which seem to take super long to warm up
start = time.perf_counter()
output_f = concrete_f(inputs)
end = time.perf_counter()
print("warm-up time...")
print(end - start)  # 0.06361602060496807
print("start performance tests...")

count=100
for i in range(count):
  total_time = 0 if i == 0 else total_time
  start = time.perf_counter()
  output_f = concrete_f(inputs)
  end = time.perf_counter()
  time_taken = end - start
  total_time += time_taken
print("average time for tf.nn.conv2d is {}".format(total_time/count)  )

for i in range(count):
  total_time = 0 if i == 0 else total_time
  start = time.perf_counter()
  output_g = concrete_g(inputs)
  end = time.perf_counter()
  time_taken = end - start
  total_time += time_taken
print("average time for tf.nn.convolution is {}".format(total_time/count)  )

Here is what I got: warm-up time... 1.4917952585965395 start performance tests... average time for tf.nn.conv2d is 0.09312625492922962 average time for tf.nn.convolution is 0.09329927522689103

Seems conv2d performs slightly better but not sure significantly enough for a major concern.

The different results are likely caused by different system setup. Here is mine. Python version: 3.8.5 ONNX version: 1.8.1 ONNX-TF version: 1.8.0 Tensorflow version: 2.4.1 tf2onnx version: 1.8.5 GPU: Tesla P100

onnx / onnx-tensorflow

Calculation error and performance hit between dilated 2d convolutions (TF -> ONNX -> TF) #914