tensorflow / tensorrt

TensorFlow/TensorRT integration
Apache License 2.0
737 stars 226 forks source link

TensorRT rebuilds ops at run time #195

Open nrothGIT opened 4 years ago

nrothGIT commented 4 years ago

At run time, I call converter.convert() converter.build(input_fn = some_input_fn) converter.save() `

And see logging to the effect of

tensorflow/core/grappler/optimizers/meta_optimizer.cc:814] Optimization results for grappler item: StatefulPartitionedCall/model_1/TRTEngineOp_1_native_segment
 2020-03-18 01:30:00.932297: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:816]   constant_folding: Graph size after: 33 nodes (0), 32 edges (0), time = 1.199ms.
 2020-03-18 01:30:00.932302: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:816]   layout: Graph size after: 33 nodes (0), 32 edges (0), time = 1.307ms.
2020-03-18 01:30:00.932307: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:816]   constant_folding: Graph size after: 33 nodes (0), 32 edges (0), time = 1.228ms.
...
2020-03-18 01:30:37.946847: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:736] Building a new TensorRT engine for StatefulPartitionedCall/model_1/TRTEngineOp_1 with input shapes: [[25,64,4,4], [25,64,4,4]]

However, at run time, after loading the above saved model, and after the model is run a couple times, I then see that the model tries to build a new engine.

2020-03-18 01:31:58.716377: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:736] Building a new TensorRT engine for PartitionedCall/StatefulPartitionedCall/model_1/TRTEngineOp_1 with input shapes: [[25,64,4,4], [25,64,4,4]]

In particular, the PartitionedCall nests the previous engine name. Any thoughts on why/how to debug why a new engine would be made despite the build call. I have verified the inputs sizes are the same and you can see that in the logs.

EDIT: It appears this is a semi-known issue that is meant to be handled by name canonicalization here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc#L869 , but that logic is not being correctly applied here as PartitionedCall/StatefulPartitionedCall/model_1//TRTEngineOp_1 and StatefulPartitionedCall/model_1/TRTEngineOp_1 are not matched to the same engine despite having the same postfix.

cbodenst commented 4 years ago

I am having the same issue. I need a low latency inference for a stateful recurrent network and the engine rebuilding adds a lot of latency to it. Also after rebuilding my throughput drops after rebuilding. My application is running with ~40FPS and after the engine rebuilding it drops to ~30FPS.

Tried to update to tf-2.2-rc1, it still has the same issue.

cbodenst commented 4 years ago

@nrothGIT did you find any workaround using the information you posted?

sanjoy commented 4 years ago

CC @bixia1

nrothGIT commented 4 years ago

@cbodenst, Sadly I haven't had the time to dig a ton further on this, so have no good fix at the moment. I'll post back if I make any progress.

tfeher commented 4 years ago

Thanks for the report. It should not try to rebuild the engines again, but it is difficult to say anything more at this stage. It might help if you can post the full logs with level 2:

TF_CPP_VMODULE=trt_engine_op=2,convert_nodes=2,convert_graph=2,segment=2,trt_shape_optimization_profiles=2,trt_engine_resource_ops=2 python my_script.py 

It would be better if you can share a reproducer, so that we can have a closer look.

cbodenst commented 4 years ago

@tfeher I will take a look into a reproducer, but I am not sure how build it since tensorrt optimizations are strongly hardware dependent.

I attached a log file. You can see there is a rebuilding at the beginning of the script. After another 2 Minutes it is building the engine again. You can just grep for "Building TensorRT engine"

I also figured out why the model is slower after rebuilding. tf-2.2.0rc2 fails to build a node and is using the native one. With tf-2.1.0 the whole model got properly converted. In the test you see in the logs I converted the model directly with tf-2.2.0rc2.

My specs: Cuda: 10.1 CUDNN: 7.6 TensorRT: 6.5 tensorflow: 2.2.0rc2

log.zip

andriiyurkiv commented 4 years ago

Hello, I'm facing the similar issue while trying to run multiple TensorRT-optimized models in TF2.

import tensorflow as tf
from tensorflow.python.saved_model import tag_constants

TENSORRT_WEIGHTS_ROOT = '/path/to/weights/root/'

class TensorRTModel:
    def __init__(self, model_name, weights_root=TENSORRT_WEIGHTS_ROOT):
        self.weights_root = weights_root
        self.model_name = model_name
        self.model, self.infer = self.__load_tensor_rt_model(model)

    def __load_tensor_rt_model(self, model):
        saved_model_loaded = tf.saved_model.load(os.path.join(
                self.weights_root, 
                self.model_name, 
                f'{self.model_name}_Conversions.TRT_FP32_MODEL'),
            tags=[tag_constants.SERVING])
        infer = saved_model_loaded.signatures['serving_default']
        return saved_model_loaded, infer

class Model1(TensorRTModel):
    # some model specific code

class Model2(TensorRTModel):
    # some model specific code

if __name__ == '__main__':
    model1 = Model1('model1')
    model2 = Model2('model2')

    for i in range(1000):
        print(i)
        model1_input = tf.constant(np.random.normal(size=(8, 640, 640, 3)).astype(np.float32))
        result = model1.infer(model1_input)
        model2_input = tf.constant(np.random.normal(size=(8, 224, 224, 3)).astype(np.float32))
        result = model2.infer(model2_input)

Model1 TensorRT optimizes to 2 ops: PartitionedCall/TRTEngineOp_1 input shapes: [[8,640,640,3]] PartitionedCall/TRTEngineOp_0 input shapes: [[8,1536,18,18]]

Model2 TensorRT optimizes to 1 op: PartitionedCall/TRTEngineOp_0 input shapes: [[8,224,224,3]]

At runtime engine for PartitionedCall/TRTEngineOp_0 gets rebuilt at each iteration, while for PartitionedCall/TRTEngineOp_1 engine is built only once.

Logs attached.

Environment: nvcr.io/nvidia/tensorflow:19.12-tf2-py3 docker container

log.txt

Is it possible to manually set Op name/prefix so those two Ops would not be considered the same entity?

bixia1 commented 4 years ago

Is it possible to manually set Op name/prefix so those two Ops would not be considered the same entity?

We merged a change for this on April 3. Can you update your tree to check that again?

andriiyurkiv commented 4 years ago

We merged a change for this on April 3. Can you update your tree to check that again?

Can you share a link to the corresponding PR?

bixia1 commented 4 years ago

https://github.com/tensorflow/tensorflow/blob/c3e12bcd6d01f3bc92cbb99f3c91fc0bec3a8562/tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc#L674-L675

https://github.com/tensorflow/tensorflow/commit/21de77485f9230eff364a4f0c8e0a55857965223

andriiyurkiv commented 4 years ago

@bixia1 @tfeher I've tried to run the code snipped I shared with you using the latest version of the TF master branch. As you can see from the logs, now it does not output any TensorRT engine building - related information. But at the same time, the timeout remains the same as if the engines were rebuilt at each iteration.

Those two models were optimized using the new version of TF as well.

log_latest_master.txt

cbodenst commented 4 years ago

I tested again with tensorflow 2.2.0rc3 and it is still rebuilding. The program runs fine for 10 Minutes but then it suddenly rebuilds the engine.

cbodenst commented 4 years ago

https://github.com/tensorflow/tensorflow/blob/c3e12bcd6d01f3bc92cbb99f3c91fc0bec3a8562/tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc#L674-L675

tensorflow/tensorflow@21de774

I also tried to build the current master (f98da6aa76ab991fb06f2de24ee7a8b87348d66e). It is still rebuilding the engines.

cbodenst commented 4 years ago

Are there any activities about that issue?

bixia1 commented 4 years ago

I will look into this, it is the next thing on my todo list.

bixia1 commented 4 years ago

andriiyurkiv Can you post a simplified version of your test case?

bixia1 commented 4 years ago

Can one of you post a reproducer for this?

cbodenst commented 4 years ago

@bixia1 I will work on that

cbodenst commented 4 years ago

Please try this code. Im not sure if this is hardware dependent but it builds the engine twice. Once at converter.build and once when the itteration starts. But I did not get it to rebuild again after multiple predictions, as it is the case in my production code.

from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Flatten, Conv2D, Dense, Input
from tensorflow.keras.models import load_model
import numpy as np
import tensorflow as tf

data = tf.constant(np.empty([1,200, 200,3],np.float32))

def gen():
    yield [data]

def create_model():
    i = Input((200,200,3), batch_size=1)
    x =  Conv2D(16,(3,3))(i)
    x =  Flatten()(x)
    x =  Dense(10,activation="relu")(x)
    x =  Dense(10,activation="relu")(x)
    m =  Model(i, x)
    m.compile("sgd", "mse")
    model_path = "my_model"
    m.save(model_path,save_format=tf)
    print("model created")
    return model_path

def convert_to_trt(input_path: str, output_path: str, data_gen=None, precision="FP32"):
    conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS
    conversion_params = conversion_params._replace(precision_mode=precision)
    converter = trt.TrtGraphConverterV2(
        input_saved_model_dir=input_path, conversion_params=conversion_params
    )
    converter.convert()
    print("build engine ...")
    if data_gen:
        converter.build(data_gen)
    converter.save(output_path)
    print("model converted")

if __name__ == "__main__":
    input_path = create_model()
    output_path = "trt_model"
    convert_to_trt(input_path, output_path, gen)
    model = load_model(output_path).signatures["serving_default"]
    for i in range(100000):
        print("round:",i)
        model(data)

Here a part of the output:

build engine ...
2020-06-04 16:16:39.485947: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for TRTEngineOp_0 with input shapes: [[1,200,200,3]]
2020-06-04 16:16:39.486004: I tensorflow/compiler/tf2tensorrt/convert/convert_nodes.cc:1153] Linked TensorRT version: 6.0.1
2020-06-04 16:16:39.486091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-04 16:16:39.486109: I tensorflow/compiler/tf2tensorrt/convert/convert_nodes.cc:1154] Loaded TensorRT version: 6.0.1
2020-06-04 16:16:39.487998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
model converted
round: 0
2020-06-04 16:16:43.393328: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for PartitionedCall/TRTEngineOp_0 with input shapes: [[1,200,200,3]]
bixia1 commented 4 years ago

I don't see the problem using the test case provided in comment #11.

cbodenst commented 4 years ago

@bixia1 The problem is that the TRT Engine got build twice because of the different name prefix. The TRT-Engine-Rebuild could also happen again after a lot of predictions, but this is not simple to reproduce. In general, this prevents high throughput use cases.

Also, please take a look into the log files I posted some time ago,

@tfeher I will take a look into a reproducer, but I am not sure how build it since tensorrt optimizations are strongly hardware dependent.

I attached a log file. You can see there is a rebuilding at the beginning of the script. After another 2 Minutes it is building the engine again. You can just grep for "Building TensorRT engine"

I also figured out why the model is slower after rebuilding. tf-2.2.0rc2 fails to build a node and is using the native one. With tf-2.1.0 the whole model got properly converted. In the test you see in the logs I converted the model directly with tf-2.2.0rc2.

My specs: Cuda: 10.1 CUDNN: 7.6 TensorRT: 6.5 tensorflow: 2.2.0rc2

log.zip

bixia1 commented 4 years ago

@cbodenst Can you point out which one is a rebuild? I use command "grep -e "Building a" -e "Finished" log" to process your log and don't see which one is a rebuilt. See the file I attach here, they seems to have different shapes. processed.txt

cbodenst commented 4 years ago

@cbodenst Can you point out which one is a rebuild? I use command "grep -e "Building a" -e "Finished" log" to process your log and don't see which one is a rebuilt. See the file I attach here, they seems to have different shapes. processed.txt

in your processed.txt you can see the rebuilt at line 4, timestamp "2020-04-08 11:40:48.633176". 12 minutes after the processing loop has started

bixia1 commented 4 years ago

I see that the engine is not created successfully: 2020-04-08 11:28:58.582814: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:1002] Engine creation for StatefulPartitionedCall/TRTEngineOp_2 failed. The native segment will be used instead. Reason: Unimplemented: Transpose too large:10437228

bixia1 commented 4 years ago

@cbodenst can you try setting this env variable to see if that help? TF_DEBUG_TRT_ALLOW_INEFFICIENT_TRANSPOSE=1

cbodenst commented 4 years ago

I already tried that. This helps to build the TRT-engine properly but still results in rebuilding after some time.

bixia1 commented 4 years ago

You meant, even though the engine is built successfully, it will be rebuilt after a while? Can you provide a log to show that?

cbodenst commented 4 years ago

I made another run with TF_DEBUG_TRT_ALLOW_INEFFICIENT_TRANSPOSE=1. You can see that the engine gets rebuild at timestamp: 2020-06-22 15:24:16.430878. Also the rebuild fails sometimes as you can see in the logs. log.zip

cbodenst commented 4 years ago

@bixia1 are you still on it? I actually really liked to use tf-trt, but if this is not working I need to invest time into alternatives. Didn't @nrothGIT described at his very first post what the issue is? Or is this unrelated?

bixia1 commented 4 years ago

@cbodenst Yes, I will look at this again soon.

cbodenst commented 4 years ago

Now with TF 2.3.0, it does not show any rebuilt messeage but the process stucks sometimes with following error message:

WARNING:tensorflow:10 out of the last 10 calls to <function recreate_function.<locals>.restored_function_body at 0x7fb2b882a8c8> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

When I created the engines initially, I got following error message:

Running native segment forStatefulPartitionedCall/functional_1/retinanet-bbox/filtered_detections/map/while/body/_307/functional_1/retinanet-bbox/filtered_detections/map/while/TRTEngineOp_0_7 due to failure in verifying input shapes: Incorrect batch dimension, for StatefulPartitionedCall/functional_1/retinanet-bbox/filtered_detections/map/while/body/_307/functional_1/retinanet-bbox/filtered_detections/map/while/TRTEngineOp_0_7: [[0,1]]
2020-07-30 13:57:04.612783: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:588] Running native segment forStatefulPartitionedCall/functional_1/retinanet-bbox/filtered_detections/map/while/body/_307/functional_1/retinanet-bbox/filtered_detections/map/while/TRTEngineOp_0_9 due to failure in verifying input shapes: Incorrect batch dimension, for StatefulPartitionedCall/functional_1/retinanet-bbox/filtered_detections/map/while/body/_307/functional_1/retinanet-bbox/filtered_detections/map/while/TRTEngineOp_0_9: [[0,1]]

Do you think this is related or a new error in TF 2.3.0?

CaesarWWK commented 4 years ago

Have the issue on using the nvidia official docker image of tensorflow2.2 with cuda11.0

When I build the converter using training data, the log says building engine for trtengineop_0 with input shape like [[1,21,2],[1,21,2]] But when I was using it for inference time, the log says building new engine for partitionedcall/trtengineop_0 with the same input shape builded before like [[1,21,2],[1,21,2]]

I will switch to tensorflow2.3 and test it again

lminer commented 3 years ago

I'm seeing this issue as well, although TF_DEBUG_TRT_ALLOW_INEFFICIENT_TRANSPOSE=1 seems to help, at least with small loads

louisquinn commented 3 years ago

Hi all, is anyone still working on this? I have been experiencing this issue for a while.

For us, it only happens when running with Python. In our C++ backend we don't see this behaviour.

Below is the section of the logs that shows the ops rebuilt after about 5 mins of runtime. After the ops are rebuilt, the program continues processing as normal. I haven't seen the ops being rebuilt again after the first time.

Software and model type:

2021-04-13 17:49:53.622132: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_0 with input shapes: [[1,300,300,3]]
2021-04-13 17:49:53.622685: I tensorflow/compiler/tf2tensorrt/convert/convert_nodes.cc:1153] Linked TensorRT version: 6.0.1
2021-04-13 17:49:53.622712: I tensorflow/compiler/tf2tensorrt/convert/convert_nodes.cc:1154] Loaded TensorRT version: 6.0.1
2021-04-13 17:49:53.680619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2021-04-13 17:49:54,157:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_0._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_11._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_12._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_2._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_3._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_4._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_6._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_5._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_7._serialized_trt_resource_filename
2021-04-13 17:49:54,158:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_13._serialized_trt_resource_filename
2021-04-13 17:49:54,159:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_10._serialized_trt_resource_filename
2021-04-13 17:49:54,159:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_1._serialized_trt_resource_filename
2021-04-13 17:49:54,159:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_8._serialized_trt_resource_filename
2021-04-13 17:49:54,159:WARNING:tensorflow:Line 144:: Unresolved object in checkpoint: (root).trt_engine_resources.TRTEngineOp_9._serialized_trt_resource_filename
2021-04-13 17:49:54,159:WARNING:tensorflow:Line 152:: A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
2021-04-13 17:51:05.557532: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_11 with input shapes: [[5919,4]]
2021-04-13 17:51:05.557604: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_12 with input shapes: [[5919,5]]
2021-04-13 17:51:05.557906: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_2 with input shapes: [[5919]]
2021-04-13 17:51:05.612497: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_3 with input shapes: [[5919]]
2021-04-13 17:51:05.674267: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_4 with input shapes: [[5919], [5919]]
2021-04-13 17:51:05.674272: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_5 with input shapes: [[5919], [5919]]
2021-04-13 17:51:05.750969: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_6 with input shapes: [[5919], [5919]]
2021-04-13 17:51:05.758456: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_7 with input shapes: [[5919], [5919]]
2021-04-13 17:51:05.861012: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/Postprocessor/TRTEngineOp_13 with input shapes: [[1,5919,4]]
2021-04-13 17:51:05.893920: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/TRTEngineOp_10 with input shapes: [[400]]
2021-04-13 17:51:05.931622: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/TRTEngineOp_1 with input shapes: [[400,4]]
2021-04-13 17:51:06.208330: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Area/TRTEngineOp_8 with input shapes: [[400,4]]
2021-04-13 17:51:06.315875: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:982] Building a new TensorRT engine for StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ChangeCoordinateFrame/TRTEngineOp_9 with input shapes: [[400,4]]