No speed improvements after TF-TRT optimizing

EsmeYi commented 5 years ago

A small tip which may be useful

FP16 or INT8 does improve the inference speed, but not all hardwares support such precision modes. NVIDIA hardware and which precision modes each hardware supports: https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux RHEL 7.6
CPU architecture: power8, ppc64le
TensorFlow version (use command below):1.13.1
Python version: Python 2.7.16 :: Anaconda, Inc.
CUDA/cuDNN version: CUDA 10.1, cuDNN 7.6
GPU model and memory: 2 Tesla V100

Result

Dataset: 4952 amount, 300*300 size

Model	Model size	Num Nodes	Batch size	mAP	Latency (ms)	img/sec
FasterRCNN	69M	6931	64	0.7021	342316	15.28
FasterRCNN (TF-TRT)	53M	6456	64	0.7019	334819	15.17
MaskRCNN	78M	7096	32	0.6977	426658	11.67
MaskRCNN (TF-TRT)	53M	6622	32	0.6974	406786	12.17

hongym7 commented 5 years ago

I have same issue(?), too. There is no big change.

BertrandD commented 5 years ago

Do you have the code you used to generate the TF-TRT version of your model? In your optimized graph, do you have any TRTEngineOp node?

len([1 for n in frozen_graph.node if str(n.op)=='TRTEngineOp'])

EsmeYi commented 5 years ago

@BertrandD

1) My TF version is 1.13.1, therefore I import tensorflow.contrib.tensorrt instead of tensorflow.python.compiler.tensorrt and use trt.create_inference_graph() instead of trt.TrtGraphConverter() in my code. 2) In order to deploying models on Tensorflow:Serving, I created the TF-TRT inference graph from a SavedModel. I don't know how to count TRTEngineOp node in a SavedModel.

I followed TF-TRT Workflow With A SavedModel and also tried saved_model_cli convert tool.

Any help would be grateful!

huaifeng1993 commented 5 years ago

I soved the problem by using nvidia-docker with the tensorflow19.04 container.Refereing the Tensorflow-TensorRt-user-guid ,I find that there are only two ways to install TF-TRT:using container or compiling TensorFlow with TensorRT integration from its source.

BertrandD commented 5 years ago

@Eloring Do you have any logs ? I had problems to create an optimized graph (from a rfcn model), but by playing and understandint the parameters I finally (with luck ?) got something working... By reading the 2 links you gave I cannot figure out the problem, maybe with logs I will see something...

EsmeYi commented 5 years ago

@huaifeng1993 I have tried to use container before: docker pull nvcr.io/nvidia/tensorflow:19.05-py2 however the container does not support ppc64le (Power CPU): standard_init_linux.go:178: exec user process caused "exec format error"

I supposed TF-TRT was built-in as default by tensorflow-gpu...(maybe I was wrong...)

thanks for your guide, I'll try to compile from source.

EsmeYi commented 5 years ago

@BertrandD

saved_model_cli convert \
--dir "/home/yilrr/tf-serving/faster-rcnn/saved_model/versions/1" \
--output_dir "/home/yilrr/tf-serving/trt-frcnn" \
--tag_set serve \
tensorrt --precision_mode FP32 --max_batch_size 32 --is_dynamic_op True

the saved_model_cli convert tool will call tensorflow.contrib.tensorrt.create_inference_graph()

Here are logs during model conversion

2019-06-27 22:43:27.553644: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph
2019-06-27 22:43:27.553729: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 6441 nodes (-490), 10465 edges (-509), time = 805.309ms.
2019-06-27 22:43:27.553742: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   layout: Graph size after: 6468 nodes (27), 10492 edges (27), time = 253.081ms.
2019-06-27 22:43:27.553755: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 6456 nodes (-12), 10485 edges (-7), time = 516.809ms.

EsmeYi commented 5 years ago

@BertrandD

It seems the model was converted successfully with TF-TRT after I updated tensorflow-gpu from v1.13 to v1.14 and rebuilt TF-TRT env, where it used tensorflow.python.compiler.tensorrt instead of tensorflow.contrib.tensorrt.

2019-06-28 11:16:03.857404: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:460] There are 2722 ops of 57 different types in the graph that are not converted to TensorRT: Sum, TopKV2, Select, CropAndResize, Fill, Split, Transpose, Where, Size, GatherV2, Greater, Equal, NonMaxSuppressionV3, Reshape, Add, ResizeBilinear, Assert, LoopCond, Merge, Squeeze, Enter, DataFormatVecPermute, ZerosLike, Less, Range, Placeholder, TensorArrayV3, TensorArraySizeV3, TensorArrayScatterV3, Cast, Maximum, StridedSlice, Shape, Minimum, Switch, TensorArrayReadV3, Prod, Identity, ExpandDims, ConcatV2, Unpack, RealDiv, Pad, Slice, LogicalAnd, Mul, Round, TensorArrayWriteV3, GreaterEqual, NoOp, Pack, Exit, NextIteration, TensorArrayGatherV3, Sub, Const, Tile, (For more information see https://docs.nvidia.com/deeplearning/dgx/tf-trt-user-guide/index.html#supported-ops).
2019-06-28 11:16:04.378135: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:733] Number of TensorRT candidate segments: 18
2019-06-28 11:16:04.684423: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-06-28 11:16:04.684771: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node ClipToWindow/TRTEngineOp_0 added for segment 0 consisting of 8 nodes succeeded.
2019-06-28 11:16:04.684937: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 4 nodes succeeded.
2019-06-28 11:16:04.685106: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.685303: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.685498: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_4 added for segment 4 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.685696: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_5 added for segment 5 consisting of 18 nodes succeeded.
2019-06-28 11:16:04.705593: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_6 added for segment 6 consisting of 442 nodes succeeded.
2019-06-28 11:16:04.708003: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_7 added for segment 7 consisting of 4 nodes succeeded.
2019-06-28 11:16:04.708203: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_8 added for segment 8 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.708369: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_9 added for segment 9 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.708506: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node GridAnchorGenerator/TRTEngineOp_10 added for segment 10 consisting of 8 nodes succeeded.
2019-06-28 11:16:04.708626: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node GridAnchorGenerator/TRTEngineOp_11 added for segment 11 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.708736: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node GridAnchorGenerator/TRTEngineOp_12 added for segment 12 consisting of 3 nodes succeeded.
2019-06-28 11:16:04.725830: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_13 added for segment 13 consisting of 169 nodes succeeded.
2019-06-28 11:16:04.727548: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_14 added for segment 14 consisting of 7 nodes succeeded.
2019-06-28 11:16:04.728181: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_15 added for segment 15 consisting of 7 nodes succeeded.
2019-06-28 11:16:04.728442: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node SecondStagePostprocessor/TRTEngineOp_16 added for segment 16 consisting of 8 nodes succeeded.
2019-06-28 11:16:04.728586: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node SecondStagePostprocessor/TRTEngineOp_17 added for segment 17 consisting of 7 nodes succeeded.
2019-06-28 11:16:04.945385: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-06-28 11:16:04.945483: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 6456 nodes (-475), 10488 edges (-486), time = 764.6ms.
2019-06-28 11:16:04.945501: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   layout: Graph size after: 6483 nodes (27), 10515 edges (27), time = 245.293ms.
2019-06-28 11:16:04.945517: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 6471 nodes (-12), 10508 edges (-7), time = 489.997ms.
2019-06-28 11:16:04.945540: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   TensorRTOptimizer: Graph size after: 5741 nodes (-730), 9719 edges (-789), time = 1155.79297ms.

hongym7 commented 5 years ago

I run fine-tune object detection model. Get .pb file. and then run tenssort. my log is below.

    graph_size(MB)(native_tf): 181.3
graph_size(MB)(trt): 182.1
num_nodes(native_tf): 2564
num_nodes(tftrt_total): 1594
num_nodes(trt_only): 0
time(s) (trt_conversion): 2.6404

Is it right ?

EsmeYi commented 5 years ago

@hongym7 num_nodes(trt_only): 0 that means your converted model doesn't have TensorRT node (i.e. TRTEngineOp)

hongym7 commented 5 years ago

@Eloring Um... Thank you. I need more research. I'll let you know how to find it.

hongym7 commented 5 years ago

@Eloring My source is

from tftrt.examples.object_detection import optimize_model import tensorflow.contrib.tensorrt as trt import tensorflow as tf

config_path = '/home/hong/PycharmProjects/tensorflow_models_drone/research/faster_rcnn_resnet101_drone_27.config' checkpoint_path = '/home/hong/PycharmProjects/tensorflow_models_drone/research/train_result_drone_27/model.ckpt-18000'

frozen_graph = optimize_model( config_path=config_path, checkpoint_path=checkpoint_path, use_trt=True, precision_mode='FP16' )

...

Is it right ?

ref : https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection#od_optimize

EsmeYi commented 5 years ago

@hongym7 What's your Tensorflow version and TensorRT version? Can you show me the logs?

Here is my core code:

import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt
def frozen_graph_trt(
    input_frozen_graph_path,
    output_dir,
    max_batch_size,
    precision_mode,
    is_dynamic_op):
    '''
    create a TensorRT inference graph from a Frozen Graph
    '''
    output_node_names = [BOXES_NAME, CLASSES_NAME, SCORES_NAME, NUM_DETECTIONS_NAME]
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    output_frozen_graph_path = os.path.join(output_dir, 'trt_frozen_graph.pb')
    with tf.io.gfile.GFile(input_frozen_graph_path, 'rb') as f:
        graph_def = tf.compat.v1.GraphDef()
        graph_def.ParseFromString(f.read())

    trt_graph = trt.create_inference_graph(
        input_graph_def=graph_def,
        outputs=output_node_names,
        max_batch_size=max_batch_size,
        max_workspace_size_bytes=trt.DEFAULT_TRT_MAX_WORKSPACE_SIZE_BYTES,
        precision_mode=precision_mode,
        is_dynamic_op=False)

    with open(output_frozen_graph_path, 'wb') as f:
        f.write(trt_graph.SerializeToString())

def ckpt_trt():
    '''
    create a TensorRT inference graph from MetaGraph And Checkpoint Files
    '''
    # use tf.graph_util.convert_variables_to_constants freeze ckpt to frozen graph
    # and then use frozen_graph_trr()

hongym7 commented 5 years ago

@Eloring TF : 1.14.0 TRT : 5.0.2.6

Thank you :D

I used your source. This is log.

2019-07-01 18:08:03.789649: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph 2019-07-01 18:08:03.789675: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2407 nodes (-648), 3139 edges (-660), time = 594.514ms. 2019-07-01 18:08:03.789710: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 2426 nodes (19), 3161 edges (22), time = 131.374ms. 2019-07-01 18:08:03.789714: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2422 nodes (-4), 3159 edges (-2), time = 318.721ms.

Is it right?

BertrandD commented 5 years ago

@Eloring Good ! Be careful, that you probably won't be able to run your 1.14 model in a 1.13 environment. You will need to use a TF 1.14 version to execute your model, and currently the 1.14.0 version do not dynamicaly load the TRTEngineOp.

In 1.13.1 you need to do add import tensorflow.contrib.tensorrt as trt in your code to load the TRTEngineOp, and in 1.14.0, tensorrt support is no longer in contrib and the dynamic load is not enabled. You need the master branch of tensorflow (or wait for the next 1.14.1 release). Dynamic load of tensorrt is done in this commit after 1.14.0 release: https://github.com/tensorflow/tensorflow/commit/408949d3e16cbdc481bcc46be051626f10eb1422

EsmeYi commented 5 years ago

@hongym7 Well, I guess your TF-TRT didn't installed successfully. It's recommended to use the tensorflow docker container provided by Nvidia, where the TF-TRT was compiled:

docker pull nvcr.io/nvidia/tensorflow:19.06-py2

TensorFlow Release 19.06

hongym7 commented 5 years ago

@Eloring Thank you for your comment. But... Is not exist solution without docker?

EsmeYi commented 5 years ago

@hongym7 IBM Watson Machine Learning Community Edition 1.6.1 (also known as PowerAI) provides software packages for several deep learning frameworks, supporting libraries, and tools. https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_software_pkgs.html https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_download.html

The easiest way to get WML CE is using anaconda:

$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
$ export IBM_POWERAI_LICENSE_ACCEPT=yes

$ conda install powerai

VincentChong123 commented 5 years ago

Hi @Eloring, @BertrandD,

Did you try ssd_mobilenet_v1 using TF-TRT? I got 25ms for INT8 compared to 31ms for fp32 (30ms for fp16) with input resolution 300x300, batch size of 1 and using synthetic data.

Is this 1.24x speed up acceptable? I cannot find speed references from link below http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf <- page41, batch1 int8 speed up is ~1.3x depends on algo https://devblogs.nvidia.com/int8-inference-autonomous-vehicles-tensorrt/ <-caffe+TRT+int8 4x speedup https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification <- without int8 tensorrt infor

number of trt_only operations is small compared to tftrt_total, is it acceptable? num_nodes(tftrt_total): 2885 int8: num_nodes(trt_only): 3 fp32/16: num_nodes(trt_only): 8 docker: nvcr.io/nvidia/tensorflow:19.05-py3 (NVIDIA-SMI 418.56 Python 3.5.2, tf 1.13.1) system: ubuntu18.04, 2080ti <- support int8

precision_mode=fp32

meta_optimizer.cc:621] Optimization results for grappler item: tf_graph
meta_optimizer.cc:623]   constant folding: Graph size after: 3379 nodes (-2748), 4233 edges (-3168), time = 447.135ms.
meta_optimizer.cc:623]   layout: Graph size after: 3394 nodes (15), 4259 edges (26), time = 118.566ms.
meta_optimizer.cc:623]   constant folding: Graph size after: 3394 nodes (0), 4259 edges (0), time = 140.997ms.
meta_optimizer.cc:623]   TensorRTOptimizer: Graph size after: 2885 nodes (-509), 3656 edges (-603), time = 416.305ms.

W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:290] Engine retrieval for batch size 9000 failed. Running native segment for Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Area/TRTEngineOp_2
    graph_size(MB)(native_tf): 27.4
    graph_size(MB)(trt): 53.3
    num_nodes(native_tf): 6127
    num_nodes(tftrt_total): 2885
    num_nodes(trt_only): 8    <- refer (1)
    time(s) (trt_conversion): 3.3262
    ---------------------------------------------------------------------------
    finish frozen_graph 
        step 100/4096, iter_time(ms)=31.7493
        step 200/4096, iter_time(ms)=31.6466

(note1)num_nodes(trt_only): 8    
    TRTEngineOp_0
    Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_5
    Postprocessor/TRTEngineOp_6
    Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_4
    TRTEngineOp_1
    Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Area/TRTEngineOp_3
    Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Area/TRTEngineOp_2 TRTEngineOp_7

precision_mode=int8

log
meta_optimizer.cc:621] Optimization results for grappler item: tf_graph
meta_optimizer.cc:623]   constant folding: Graph size after: 3379 nodes (-2748), 4233 edges (-3168), time = 441.048ms.
meta_optimizer.cc:623]   layout: Graph size after: 3394 nodes (15), 4259 edges (26), time = 118.754ms.
meta_optimizer.cc:623]   constant folding: Graph size after: 3394 nodes (0), 4259 edges (0), time = 142.857ms.
meta_optimizer.cc:623]   TensorRTOptimizer: Graph size after: 2894 nodes (-500), 3665 edges (-594), time = 19535.6191ms.

2019-07-09 03:23:28.745222: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for TRTEngineOp_1 with batch size 9000
graph_size(MB)(native_tf): 27.4
graph_size(MB)(trt): 53.2
num_nodes(native_tf): 6127
num_nodes(tftrt_total): 2894
num_nodes(trt_only): 3          <- refer(note2)
time(s) (trt_conversion): 23.2403
    step 100/73, iter_time(ms)=2908.3175

results:
finish frozen_graph 
    step 100/4096, iter_time(ms)=25.8242
    step 200/4096, iter_time(ms)=25.3153

(note2)num_nodes(trt_only): 3  
    TRTEngineOp_0
    Postprocessor/TRTEngineOp_6
    TRTEngineOp_1

code
precision_mode = 'INT8' 

    frozen_graph = optimize_model(
        config_path=config_path,
        checkpoint_path=checkpoint_path,
        use_trt=True,
        force_nms_cpu=False,  # default true
        precision_mode=precision_mode,
        max_workspace_size_bytes=1 << 32,
        maximum_cached_engines=100,
        calib_images_dir='/N/data-sata/fast-ai-coco/coco-2014/val2014',
        num_calib_images=100,
        calib_image_shape=(300,300),
        output_path="{}.output_path.{}.graph".format(config_path,precision_mode)
    )

from tftrt.examples.object_detection import benchmark_model
statistics = benchmark_model(
frozen_graph=frozen_graph,
images_dir=images_dir,
annotation_path=annotation_path,
use_synthetic=True,
image_shape=(300,300)
)

EsmeYi commented 5 years ago

Hi @weishengchong Well, in my opinion, using TF-TRT to accelerate TF models can't reach the same speedup as using TRT UFF parser to build the engine.

Using TF-TRT:

TensorRT optimizes the largest subgraphs possible in the TensorFlow graph. The more compute in the subgraph, the greater benefit obtained from TensorRT. You want most of the graph optimized and replaced with the fewest number of TensorRT nodes for best performance. Based on the operations in your graph, it’s possible that the final graph might have more than one TensorRT node.

Which means each TRTEngineOp will contain a serialized subgraph GraphDef, where a subgraph contained several TF nodes in the original graph, so it is acceptable TRT-node is smaller than TF-total.

TF-TRT producing an optimized model that runs in TensorFlow for inference, if it fails to execute the TRT engine, the TRT op will fall back to call the corresponding TF function.

Using TensorFlow/UFF parser:

Converts TF graph to UFF file format (need add custom plugins for unsupported layers)
Loads the UFF model and creates the UFF parser
Builds an optimized engine
Uses the engine to perform inference in TensorRT

For me, the most challenging step is to add custom layers (code plugins in C++)....

TensorRT provides a UFF-SSD sample. I have evaluated the sample and the result shows TRT make more than 6.7x speedup (both in FP32 mode). But for SSD-TFTRT, there is few improvement. While I also tested Faster-RCNN-TFTRT, I got 2.1x speedup.

VincentChong123 commented 5 years ago

Hi @Eloring,

TensorRT provides a UFF-SSD sample. TRT make more than 6.7x speedup (both in FP32 mode). Did you try int8 UFF-SSD?

It is reported only 1.3x speedup for GTX 1080Ti,

FP32 inference time: ~ 9 ms INT8 inference time: ~ 6 ms

Thanks again.

EsmeYi commented 5 years ago

Hi @weishengchong From original Tensorflow Frozen Graph to TensorRT Engine, TRT made 6.7x speedup, where no quantization like FP16 or INT8 was used. That's what I mean :) And I haven't tried int8 UFF-SSD.

ZhuoranLyu commented 5 years ago

Hi @weishengchong @Eloring Did you guys figure out how to use tftrt without docker? Using tftrt in docker indeed works for me. However, I'd like to use tftrt with C/C++ API in native (Windows) env. Do you know how to figure this out? Thanks.

VincentChong123 commented 5 years ago

Hi @ZhuoranLyu,

I only success in using docker.

I have no idea about running trt on Windows, FYI https://devtalk.nvidia.com/default/topic/1055484/tensorrt/deepstream_reference_apps-trt-yolo-app-windows-build/

Hi @Eloring thanks for your advice.

ZhuoranLyu commented 5 years ago

@weishengchong I ran tensorrt successfully on windows. However, I was wondering how to run tensorflow-tensorrt(tftrt) on windows.

PetreanuAndi commented 5 years ago

Hello guys. @Eloring , I am especially interested in this thread. I am trying to convert a SSD_Resnet50_FPN model to TFRT. Everything works just fine, i converted both saved_model and inference graph, FP16 & FP32, tried all the options (fixed input size etc), but the output i get is :

2019-08-07 13:28:26.380795: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph 2019-08-07 13:28:26.381027: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2836 nodes (-1660), 4183 edges (-1854), time = 593.697ms. 2019-08-07 13:28:26.381058: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 2880 nodes (44), 4255 edges (72), time = 152.605ms. 2019-08-07 13:28:26.381160: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2880 nodes (0), 4255 edges (0), time = 195.618ms. graph_size(MB)(native_tf): 123.3 graph_size(MB)(trt): 123.2 num_nodes(native_tf): 4496 num_nodes(tftrt_total): 2880 num_nodes(trt_only): 0 time(s) (trt_conversion): 2.9199 number of TRT ops in the converted graph : 0

There are no trt_only nodes, and no TRT ops. My original TF frozen graph had 0.0248s inference time (1080Ti) My TFRT frozen graph has 0.0251 inference time (so slightly bigger, average on [1:1000] random images)

Is FPN or ResNet (skip connections) the cause for this failed optimization? (I mean it compiles and works, but does so slower than before optimization) I also extract features from 4 different feature maps in the encoder, corresponding to the FPN heads. (i specified all output nodes in the conversion procedure) Maybe that's why it does not optimize well? I need these 4 outputs for a fused encoding volume that is passed to an LSTM, so that's important.

Guys, anything would help at this point, Thank you very much in advance!

EsmeYi commented 5 years ago

hi @PetreanuAndi , these are my logs of converting a ssd_resnet_50_fpn_coco model ( from tensorflow_model_zoo ).

2019-08-08 11:10:38.312574: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph 2019-08-08 11:10:38.312798: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 23991 nodes (-14028), 30371 edges (-16346), time = 3906.88501ms. 2019-08-08 11:10:38.312813: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 24008 nodes (17), 30401 edges (30), time = 996.894ms. 2019-08-08 11:10:38.312825: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 24008 nodes (0), 30401 edges (0), time = 1110.90796ms. 2019-08-08 11:10:38.312837: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] TensorRTOptimizer: Graph size after: 19015 nodes (-4993), 25064 edges (-5337), time = 11150.5664ms. graph_size(MB)(native_tf): 135.2 graph_size(MB)(trt): 268.9 num_nodes(native_tf): 38019 num_nodes(tftrt_total): 19015 num_nodes(trt_only): 49 time(s) (trt_conversion): 53.7628

I noticed that there is no TensorRTOptimizer output in your logs. Did you test such a common model to validate your convert code or TF-TRT are good? Besides, what's the inference time of FP16 model? Did quantization improve the inference performance in your results?

ZhuoranLyu commented 5 years ago

@PetreanuAndi Did you use nv-docker or native env with tensorflow?

PetreanuAndi commented 5 years ago

Hello @Eloring @ZhuoranLyu

I have pip installed tensorflow gpu 1.14 into a conda env. I've also tried with 1.15 and the result is the same.

Should I build from source? Even if you have trt_only nodes (49) do you actually observe a speedup? How much? Can you detail on that?

FP16 and FP32 both did not improve performance. Moreover, they seem to actually affect performance (on 1080 Ti) : ---> original graph : avg 0.0248s ---> optimized graph : avg 0.0251s

ZhuoranLyu commented 5 years ago

@PetreanuAndi build from source or use docker

PetreanuAndi commented 5 years ago

@ZhuoranLyu i'm building from source now but can you confirm that you actually have a speedup measurement (with or without trt_only nodes) ? Have you also tried SSD + FPN? (maybe the FPN aggregation has problems in the optimisation process)

I have read some other online forum that stated only C++ TRT will have speedup. Thoughts on that? Will come back with prints and benchmarking after the source build finishes.

ZhuoranLyu commented 5 years ago

@PetreanuAndi Firstly you can try inference with python under docker environment to see if it can speed up the model. From my perspective, SSD will benefit a lot from tftrt, especially using fp16.

EsmeYi commented 5 years ago

@PetreanuAndi Same opinion as @ZhuoranLyu. Obviously, your TF-TRT environment is not installed successfully. You can try to use Docker or Anaconda to pull/install an integrated image/package, otherwise, compile them by yourself.

PetreanuAndi commented 5 years ago

Hello guys. @Eloring , @ZhuoranLyu

I have installed tensorflow from sources. V 1.14, Cuda 10.02, cudnn 7.4. Installation went good, and graph conversion went well, with the following output :

2019-08-16 12:25:39.216209: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/TRTEngineOp_90 added for segment 90 consisting of 3 nodes succeeded. 2019-08-16 12:25:39.217472: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_91 added for segment 91 consisting of 27 nodes succeeded. 2019-08-16 12:25:39.217606: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_92 added for segment 92 consisting of 5 nodes succeeded. 2019-08-16 12:25:39.217705: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_93 added for segment 93 consisting of 3 nodes succeeded. 2019-08-16 12:25:39.310289: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph 2019-08-16 12:25:39.310333: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2836 nodes (-1660), 4183 edges (-1854), time = 728.073ms. 2019-08-16 12:25:39.310337: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 2880 nodes (44), 4255 edges (72), time = 143.931ms. 2019-08-16 12:25:39.310341: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 2880 nodes (0), 4255 edges (0), time = 167.55ms. 2019-08-16 12:25:39.310345: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] TensorRTOptimizer: Graph size after: 1907 nodes (-973), 3218 edges (-1037), time = 7242.89307ms. graph_size(MB)(native_tf): 123.3 graph_size(MB)(trt): 336.9 num_nodes(native_tf): 4496 num_nodes(tftrt_total): 1907 num_nodes(trt_only): 94 time(s) (trt_conversion): 10.6979 number of TRT ops in the converted graph : 94

However, inference time is the same :) I know plain simple SSD will benefit from tfrt, but i am using SSD + FPN. Can you please confirm that you get an inference speedup with TFRT (when having trt_only nodes) ? Now i have those nodes but there is no speedup :)

There is another forum that says the TFRT compiled with tensorflow in python will not offer speedup, but the C++ version will.. Do you guys do yours in C++? I don't think that's a reasonable argument (people from tensorflow would not do such cheap shady implementations probably)

ZhuoranLyu commented 5 years ago

@PetreanuAndi , actually, using tftrt under fp32 may not accelerate(instead it may slow the speed). However, it should accelerate under fp16 precision, especially on a new GPU like 2080Ti with tensor cores. I see a 3x times speed up using fp16 with a 2080Ti.

PetreanuAndi commented 5 years ago

Hey Guys. I have build tf-nightly-gpu 1.15.0.dev20190816, with TensorRT5 (directly from TF package), cuda 10.0 and cudnn 7.6.0.

It builds / optimizes without errors, and actually outputs more trt_only nodes (102 instead of 94). Testing with FP32 yielded a poorer inference time (0.031 on average / image), but testing with FP16 did give a slight improvement over the original model (0.021s better then 0.028)

However, this improvement is still very small compared to what other people are talking about on forums (3x etc)

You still did not say : is your model compiled from the NVidia source of TRT? the C++ ? or it is installed along with tensorflow (either from source or pip)

Any other suggestions for SSD + FPN speed improvement? (this is listed as the FIRST example on the github page of tensorflow tensorrt, so i am expecting there to be some substantial improvement but all my efforts were in vain mostly)

thank you!

ZhuoranLyu commented 5 years ago

The model is built with tensorflow in python, optimized with tf-trt, just as the example shows. No need to implement the model with NV-Tensor RT in C++ cause I am not familiar with C++.

zhenpalapala commented 5 years ago

@PetreanuAndi Same opinion as @ZhuoranLyu. Obviously, your TF-TRT environment is not installed successfully. You can try to use Docker or Anaconda to pull/install an integrated image/package, otherwise, compile them by yourself.

@hongym7 Well, I guess your TF-TRT didn't installed successfully. It's recommended to use the tensorflow docker container provided by Nvidia, where the TF-TRT was compiled:

docker pull nvcr.io/nvidia/tensorflow:19.06-py2

TensorFlow Release 19.06

Hi, I just met the same issue as you. My tf version is 1.14, tf serving version is 1.13, the os is linux and tensorRT is 5.1.I want use tensorRT to speed up my model, but the output likes this

2019-08-19 09:16:25.414934: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph 2019-08-19 09:16:25.414982: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 554 nodes (-256), 616 edges (-258), time = 544.394ms. 2019-08-19 09:16:25.414990: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] layout: Graph size after: 561 nodes (7), 618 edges (2), time = 118.45ms. 2019-08-19 09:16:25.414998: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718] constant folding: Graph size after: 556 nodes (-5), 618 edges (0), time = 376.114ms.

It seems tensorRT doesnt work . I try to docker pull nvcr.io/nvidia/tensorflow:19.06-py3, but meet this error 'unauthorized: authentication required' Couled you please give me some advices? Thanks a lot

PetreanuAndi commented 5 years ago

Hello @zhenpalapala

It seems that tensorflow 1.14 has some issues, i found that out by looking in another forum. Try this instead :

sudo apt-get install --no-install-recommends cuda-10-0 libcudnn7=7.6.0.64-1+cuda10.0 libcudnn7-dev=7.6.0.64-1+cuda10.0

sudo pip install tf-nightly-gpu==1.15.0.dev20190816

Using this nightly tf1.15 , try converting your model to FP16. This is the only setup that actually improved the speed of my SSD FPN model. Nothing else worked. Hope this helps.

zhenpalapala commented 5 years ago

Thanks a lot.@PetreanuAndi I found I met this "'unauthorized: authentication required'" issue when docker pull nvcr.io/nvidia/tensorflow:19.06-py3 just because of bad connection of internet. I found that using this container can really speed up the provided Resnet model, but it didn't work when using my own model even slow down.

My origin models are saved as CKPT files, I want to optimize TensorFlow Serving Performance with NVIDIA TensorRT.

I turned CKPT files to saved_model using code below

synth.load(args.checkpoint, modified_hp) sess=synth.session output_graph_def = tf.graph_util.convert_variables_to_constants(sess=sess,input_graph_def=sess.graph_def,output_node_names=output_node_names.split(",")) tf.saved_model.simple_save( session=sess, export_dir=args.export_dir, inputs={"input_lengths": tf.get_default_graph().get_tensor_by_name('input_lengths:0'),"split_infos":tf.get_default_graph().get_tensor_by_name('split_infos:0'),"inputs":tf.get_default_graph().get_tensor_by_name("inputs:0")}, outputs={"linear_wav_outputs":audio.inv_spectrogram_tensorflow(tf.get_default_graph().get_tensor_by_name("Tacotron_model/inference/cbhg_linear_specs_projection/projection_cbhg_linear_specs_projection/BiasAdd:0")[0],hparams)},legacy_init_op=None) Then use docker command to use tensorrt：

docker run --rm --gpus all -it \ -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.06-py3 \ /usr/local/bin/saved_model_cli convert \ --dir 'my_saved_model' \ --output_dir 'my_saved_model_trt' \ --tag_set serve \ tensorrt --precision_mode FP16 --max_batch_size 1 --is_dynamic_op True

In the end, using docker command to put the final model on serving. But using this model doesn't work, on the contrary, the inference speed slows down and has some warning. E external/org_tensorflow/tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: The graph is already optimized by layout optimizer. … 2019-08-21 08:01:58.396573: W external/org_tensorflow/tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:647] Engine creation for TRTEngineOp_21 failed. The native segment will be used instead. Reason: Invalid argument: Node Tacotron_model/inference/encoder_LSTM/bidirectional_rnn/bw/bw/while/encoder_bw_LSTM/BiasAdd should have an input named 'Tacotron_model/inference/encoder_LSTM/bidirectional_rnn/bw/bw/while/encoder_bw_LSTM/MatMul' but it is not available Is this because wrong process of changing CKPT to saved_model? I cant figure out.

if anyone met this trouble before, please give me pieces of advice. Thanks a lot!

austingg commented 5 years ago

Recently, I am working on tf-trt on Tesla T4. I have found that ssd-like model speed up little when use tftrt-FP32 and about 2x using fp16(which use TensorCore). Beside, I have found that NMS op is running on CPU, so memcpyHtoD cost much time.

pooyadavoodi commented 5 years ago

TF-TRT has got a lot of improvements in 1.14. Please use that one.

The NVIDIA container that has TF1.14 is 19.07: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#matrix

pooyadavoodi commented 5 years ago

For NMS, if you can use combined_non_max_suppression in your graph, then you get much better speedup, esp because TF-TRT optimizes that.

If you use the object detection API, you can use the submodule of tensorflow/models to get combined_nms as follows:

The config file that you need to change for NMS is pipeline.config.
In the post_processing section of the config file, there is batch_non_max_suppression that specifies NMS configurations. Add this new field to the NMS config: combined_nms: true

taorui-plus commented 5 years ago

Do you have the code you used to generate the TF-TRT version of your model? In your optimized graph, do you have any TRTEngineOp node?

len([1 for n in frozen_graph.node if str(n.op)=='TRTEngineOp'])

I also encountered the same problem，After conversion，the num of TRTEngineOp is 0：

time:{'loading_frozen_graph': 0.7235217094421387, 'trt_conversion': 11.104097127914429} num_nodes:{'tftrt_total': 789, 'loaded_frozen_graph': 985, 'trt_only': 0} graph_sizes:{'loaded_frozen_graph': 233293316, 'trt': 425277533}

taorui-plus commented 5 years ago

For NMS, if you can use combined_non_max_suppression in your graph, then you get much better speedup, esp because TF-TRT optimizes that.

If you use the object detection API, you can use the submodule of tensorflow/models to get combined_nms as follows:

The config file that you need to change for NMS is pipeline.config.

In the post_processing section of the config file, there is batch_non_max_suppression that specifies NMS configurations. Add this new field to the NMS config: combined_nms: true

hello，pooyadavoodi： I recently wanted to try the tensorrt deployment model, see: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#matrix , using:

    converter = trt.TrtGraphConverter(input_graph_def=frozen_graph,
                                      nodes_blacklist=['time_distributed_1/Reshape_1'],
                                      max_batch_size=1,
                                      precision_mode='FP16',
                                      is_dynamic_op=False,
                                      Shared connection to 10.64.0.11 closed.
                                      max_workspace_size_bytes=1<<32)

But the converted map is bigger：

time:{'loading_frozen_graph': 0.7235217094421387, 'trt_conversion': 11.104097127914429}
num_nodes:{'tftrt_total': 789, 'loaded_frozen_graph': 985, 'trt_only': 0}
graph_sizes:{'loaded_frozen_graph': 233293316, 'trt': 425277533}

Try to adjust the above parameters, the size of the graph and the number of nodes have not changed, what should I do next, what documents do I need to see? What is wrong with my current use?

pooyadavoodi commented 5 years ago

trt_only: 0 suggests no TensorRT node is created. It's impossible to tell why without looking at the log.

Could you rerun the conversion with verbose logging and post the log https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#verbose

I suppose ['time_distributed_1/Reshape_1' is the output tensor of your model?

pooyadavoodi commented 5 years ago

Closing. Please reopen in case you still see the issue.

anuar12 commented 5 years ago

I didn't get any inference boost with FP16 conversion on 2080 Ti too. The inference speed is the same. I used TF1.14, converted keras retinanet into SavedModel. Here are some of the code and logs if it's helpful:

minimum_segment_size = 2,
maximum_cached_engines = 100,
precision_mode = "FP16"
converter = trt.TrtGraphConverter(
        input_saved_model_dir=saved_model_dir,
        precision_mode=precision_mode,
        minimum_segment_size=minimum_segment_size,
        is_dynamic_op=True,
        max_batch_size=32,
        max_workspace_size_bytes=7000000000,
        maximum_cached_engines=maximum_cached_engines)
frozen_graph = converter.convert()

pciBusID: 0000:09:00.0
2019-10-17 11:10:26.445500: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-17 11:10:26.445513: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-17 11:10:26.445524: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-17 11:10:26.445534: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-17 11:10:26.445545: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-17 11:10:26.445555: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-17 11:10:26.445566: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-17 11:10:26.445626: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.446358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.447040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-17 11:10:26.447065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-17 11:10:26.447072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-17 11:10:26.447078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-17 11:10:26.447212: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.447948: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-17 11:10:26.449164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8961 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2019-10-17 11:10:27.924940: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-10-17 11:10:27.924977: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 2433 nodes (-214), 3289 edges (-273), time = 735.86ms.
2019-10-17 11:10:27.924983: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   layout: Graph size after: 2490 nodes (57), 3343 edges (54), time = 141.214ms.
2019-10-17 11:10:27.924988: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 2474 nodes (-16), 3343 edges (0), time = 290.99ms.
graph_size(MB)(trt): 221.7
num_nodes(tftrt_total): 2474
num_nodes(trt_only): 0
time(s) (trt_conversion): 5.4649

Would be great if there was a better documentation with simple example (especially if the api has changed) so that we can debug on our own. :)

Programmerwyl commented 5 years ago

I found that the reason why there was no TRTEngineOp was not the code, but the hardware platform. I ran the same code on PC with mobilenet_v2. After optimization, TRTEngineOp was 0 with 426 total nodes, but I ran it on tx2 with only 3 nodes, which was much faster

Programmerwyl commented 5 years ago

on tx2 log info 2019-10-18 10:21:59.917225: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] constant folding: Graph size after: 427 nodes (-262), 436 edges (-262), time = 339.122ms. 2019-10-18 10:21:59.917289: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] layout: Graph size after: 435 nodes (8), 438 edges (2), time = 68.517ms. 2019-10-18 10:21:59.917336: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] constant folding: Graph size after: 429 nodes (-6), 438 edges (0), time = 118.121ms. 2019-10-18 10:21:59.917401: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] TensorRTOptimizer: Graph size after: 3 nodes (-426), 2 edges (-436), time = 57619.9141ms.

tensorflow / tensorrt

No speed improvements after TF-TRT optimizing #89

A small tip which may be useful