Loss of precision / incorrect to near-single-precision in converted tf.keras model

cwentland0 commented 4 years ago

I am building and training a convolutional autoencoder in tf.keras, and converting the two halves of the networks (the encoder and the decoder) separately first to ONNX (via keras2onnx) and then to a serialized TensorRT engine (via trtexec). The network only utilizes a few fundamental elements: 2D convolution kernels (in the encoder), 2D transpose-convolution kernels (in the decoder), fully-connected layers, flatten layers, reshape layers, ELU activations, and linear activations. The software versions used are as follows:

Tensorflow (Keras) 2.0.0 keras2onnx 1.6.5 TensorRT 7.0.0.11 CUDA 10.2 (Tensorflow uses 10.1)

The tf.keras models are 32-bit float networks. When I test the inference of the ONNX model (with ONNX Runtime) and the TRT engine (with the TRT Python API) on random inputs, I get some very strange answers. First, the TRT encoder exactly matches the output of the TF encoder, so apparently no problem there. However, the ONNX encoder reports O(1e-6) error compared to the TF encoder! How the TRT parser is able to fix this issue I have no idea.

Even stranger are the decoder results. The TRT and ONNX decoders both report O(1e-7) error compared to the TF decoder. Strangely, the minimum and maximum error are often slightly different (O(1e-8)) between the TRT and ONNX decoder results. Even worse, on some rare occasions the minimum and maximum error values will change from test to test, despite the fact that the NumPy RNG is seeded with the same value every time. This would imply there is some stochastic behavior in the ONNX/TRT decoder inference.

My testing code is given below:

import numpy as np
import tensorrt as trt
import pycuda.driver as cuda 
import pycuda.autoinit
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "2"
import tensorflow as tf
from tensorflow.keras.models import load_model
import onnxruntime

# seed NumPy RNG
np.random.seed(0)

# prevent TF from gobbling up device memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

# location of all models
modelsDir = '/home/user/testModels'

# get TRT models/memory
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with open(os.path.join(modelsDir,'decoder.trt'),"rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engineDecoder = runtime.deserialize_cuda_engine(f.read())
with open(os.path.join(modelsDir,'encoder.trt'),"rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engineEncoder = runtime.deserialize_cuda_engine(f.read())

context_dec = engineDecoder.create_execution_context()
context_enc = engineEncoder.create_execution_context()

h_input_dec = cuda.pagelocked_empty(trt.volume(engineDecoder.get_binding_shape(0)), dtype=np.float32)
h_output_dec = cuda.pagelocked_empty(trt.volume(engineDecoder.get_binding_shape(1)), dtype=np.float32)
h_input_enc = cuda.pagelocked_empty(trt.volume(engineEncoder.get_binding_shape(0)), dtype=np.float32)
h_output_enc = cuda.pagelocked_empty(trt.volume(engineEncoder.get_binding_shape(1)), dtype=np.float32)

d_input_dec = cuda.mem_alloc(h_input_dec.nbytes)
d_output_dec = cuda.mem_alloc(h_output_dec.nbytes) 
d_input_enc = cuda.mem_alloc(h_input_enc.nbytes)
d_output_enc = cuda.mem_alloc(h_output_enc.nbytes) 

# get tf.keras models
tfDecoder = load_model(os.path.join(modelsDir,'decoder.h5'),compile=False)
tfEncoder = load_model(os.path.join(modelsDir,'encoder.h5'),compile=False)

# get ONNX models/layer names
onnxDecoder = onnxruntime.InferenceSession(os.path.join(modelsDir,'decoder.onnx'))
onnxEncoder = onnxruntime.InferenceSession(os.path.join(modelsDir,'encoder.onnx'))
onnxDecoderInName  = onnxDecoder.get_inputs()[0].name
onnxDecoderOutName = onnxDecoder.get_outputs()[0].name
onnxEncoderInName  = onnxEncoder.get_inputs()[0].name
onnxEncoderOutName = onnxEncoder.get_outputs()[0].name

sampCode = (np.random.random_sample(tfDecoder.input_shape)).astype(np.float32) 
sampSol  = (np.random.random_sample(tfEncoder.input_shape)).astype(np.float32)

# decoder evaluation    
# evaluate tf.keras decoder (considered ground truth)
predSol_tf = tfDecoder.predict(sampCode)

# evaluate TRT decoder
np.copyto(h_input_dec, sampCode) 
cuda.memcpy_htod(d_input_dec, h_input_dec)
context_dec.execute(bindings=[int(d_input_dec), int(d_output_dec)])
cuda.memcpy_dtoh(h_output_dec, d_output_dec)
predSol_trt = (np.reshape(h_output_dec, tfDecoder.output_shape, order="C")).copy() 

# evaluate ONNX decoder
predSol_onnx = np.squeeze(np.asarray(onnxDecoder.run([onnxDecoderOutName], {onnxDecoderInName: sampCode})), axis=0)

print("TRT decoder max err: " + str(np.amax(predSol_tf - predSol_trt)))
print("TRT decoder min err: " + str(np.amin(predSol_tf - predSol_trt)))
print("ONNX decoder max err: " + str(np.amax(predSol_tf - predSol_onnx)))
print("ONNX decoder min err: " + str(np.amin(predSol_tf - predSol_onnx)))

# encoder evaluation
# evaluate tf.keras encoder (considered ground truth)
predCode_tf = tfEncoder.predict(sampSol)

# evaluate TRT encoder
np.copyto(h_input_enc,sampSol.flatten(order="C")) 
cuda.memcpy_htod(d_input_enc,h_input_enc)
context_enc.execute(bindings=[int(d_input_enc),int(d_output_enc)])
cuda.memcpy_dtoh(h_output_enc,d_output_enc)
predCode_trt = (np.reshape(h_output_enc, tfEncoder.output_shape, order="C")).copy()

# evaluate ONNX encoder
predCode_onnx = np.squeeze(np.asarray(onnxEncoder.run([onnxEncoderOutName], {onnxEncoderInName: sampSol})), axis=0)

print("TRT encoder max err: " + str(np.amax(predCode_tf - predCode_trt)))
print("TRT encoder min err: " + str(np.amin(predCode_tf - predCode_trt)))
print("ONNX encoder max err: " + str(np.amax(predCode_tf - predCode_onnx)))
print("ONNX encoder min err: " + str(np.amin(predCode_tf - predCode_onnx)))

Some sample output is shown below (warnings just a given due to the fixed network size):

[TensorRT] WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT] WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT] WARNING: Explicit batch network detected and batch size specified, use execute without batch size instead.
TRT decoder max err: 2.3841858e-07
TRT decoder min err: -2.3841858e-07
ONNX decoder max err: 2.0861626e-07
ONNX decoder min err: -2.0861626e-07
[TensorRT] WARNING: Explicit batch network detected and batch size specified, use execute without batch size instead.
TRT encoder max err: 0.0
TRT encoder min err: 0.0
ONNX encoder max err: 1.1920929e-06
ONNX encoder min err: -7.1525574e-07

Interestingly, I note that keras2onnx reports that the maximum opset needed is opset 9, despite that the ONNX documentation reports that ConvTranspose was first added in opset 11 (although ConvTranspose-1 existed in the initial release). Even if I target opset 11, the resulting model is built with opset 9 (according to trtexec).

Ultimately, I don’t care about the ONNX engine as long as the eventual TRT engine is correct. But the fact that the intermediate format (ONNX) is incorrect leaves me with no guarantee that the final format (TRT) will be correct. Therefore, I’m asking here before I move on to the TensorRT Github page. Any insight would be very much appreciated.

I am happy to share the trained models (TF, TFT, and ONNX) is someone else would like to inspect them. As a side note, if anyone could explain the phrase "Weights must be an initializer" in the requirements for the ConvTranspose operation found at the onnx-tensorrt page, that would be immensely appreciated.

wenbingl commented 4 years ago

The discrepancy between TRT and ONNXRuntime seems to be irrelevant to the converter itself. It looks the ConvTranspose operator upgrading from opset 1 to opset 11 only clarify some behavior like padding. Yes, the converter only tag the converted model with opset 11, but opset 9 is also good. BTW, can TRT engine support the opset 11 now?

cwentland0 commented 4 years ago

According to this page, TensorRT 7.0 does support opset 11, though I honestly can't find an official NVIDIA/TensorRT documentation page that explicitly states this. The documentation at this page seems very ambiguous to me, but does explicitly reference the onnx-tensorrt backend page.

onnx / keras-onnx

Loss of precision / incorrect to near-single-precision in converted tf.keras model #407