Tf2.1.0, tensorrt7, batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 0, but engine max batch size was: 1

jxhekang commented 4 years ago

I try to generate tf-trt model by using tf2.1.0's TrtGraphConverterV2(xxxx) interface. In tf2.1.0's TrtGraphConverterV2, the is_dynamic_op can only be Ture, which means the tf-trt model can handle input images of different size dynamicly. At first, I got one tf-trt model(model_A) successfully, and it seems work well and fast. However, when I changed a few parameters in my net and re-generate the tf-trt model(model_B) , the new tf-trt model became unstable. For example: it can do inference when I feed images(batch=1,H=1000,W=600, C=3) to the tf-trt model, but when I feed other images(such as:batch=1,H=1024,W=600, C=3;or batch=1,H=512,W=512, C=3), I got the error like below.....

tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:38] DefaultLogger Can't fuse pad and convolution with same pad mode tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:38] DefaultLogger Can't fuse pad and convolution with caffe pad mode tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:42] DefaultLogger Parameter check failed at: ../builder/builder.cpp::setMaxBatchSize::135, condition: batchSize > 0 && batchSize <= MAX_BATCH_SIZE tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:38] DefaultLogger Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:736] Building a new TensorRT engine for StatefulPartitionedCall/retina_net_module/retina_net_post_processor/TRTEngineOp_18 with input shapes: [[3,4]] tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:38] DefaultLogger Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:736] Building a new TensorRT engine for StatefulPartitionedCall/retina_net_module/retina_net_post_processor/TRTEngineOp_17 with input shapes: [[0,4]] tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:42] DefaultLogger Parameter check failed at: ../builder/builder.cpp::setMaxBatchSize::135, condition: batchSize > 0 && batchSize <= MAX_BATCH_SIZE tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:38] DefaultLogger Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:42] DefaultLogger Parameter check failed at: engine.cpp::enqueue::292, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 0, but engine max batch size was: 1 tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:635] Failed to enqueue batch for TRT engine: TRTEngineOp_0 tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:506] Failed to execute engine, retrying with native segment for TRTEngineOp_0 tensorflow/core/framework/op_kernel.cc:875] Check failed: mutable_output(index) == nullptr (0x7fb7c2b26c00 vs. nullptr) Aborted (core dumped)

I tried to convert my model_B in nvidia's docker(nvcr.io/nvidia/tensorflow:20.02-tf2-py3), but still got almost same error. The error notice that Batch size was: 0, but engine max batch size was: 1， I really don't know where does this batch-0 come from.... Is there any one meet the error like this?

jxhekang commented 4 years ago

After some tests, I got the answer why tf-trt-model unstable is that: It's because the input image used by me is one synthetic image(im_list = [128 * np.ones([576, 1024, 3]).astype(np.float32)]). In this image, all pixel's data is constant 128. The tf-trt-model can't get valid information to pass to some kind of trt-op, and generate some error tensor(batchsize=0) by mistake.

I think this is a bug needed to be fixed in tf-trt, because it's hard to guarantee that the input images always have valid object or information. The 'Batch size was：0' error will result in program abort，this will be one hidden danger in tf-trt....

jxhekang commented 4 years ago

@pooyadavoodi I notice your reply about 'Batch size was：0' error in the link below: https://github.com/tensorflow/tensorflow/issues/33184#issuecomment-567605513 But I still can't get enough information from the tf-trt's log to handle my 'Batch size was：0' error. In the tf-trt's log, I know the error happend in TRTEngineOp_0. However, in tf2.1.0, there is no black-node-list option. The only way I can do is to set --minimum_segment_size 40, and it works，the 'Batch size was：0' error didn't happend any more, but it also may lead to tf-trt‘s inefficient. Hope your team can handle this error in the next tf-trt version.

sanjoy commented 4 years ago

@bixia1

tensorflow / tensorrt

Tf2.1.0, tensorrt7, batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 0, but engine max batch size was: 1 #191