Int8 Inference and deployment on deepstream

Unfixab1e commented 4 years ago

Hey @jkjung-avt , I didn't see any code for int8 inference so I tried to implement it myself and managed to get the calibrator working for yolov4/int8. Do you have any plans to add int8 to your eval_yolo code? It would be very useful!

Also I managed to deploy the fp16 and fp32 models using the deepstream sdk but only after replacing your yolo_to_onnx.py code. I saw some people struggling with deepstream integration so my advice is to check out this repo and grab the Darknet2ONNX code from there.

Sadly my int8 calibration has some issues (no object detected at all) so I'm wondering if the ONNX model i created from the second repo is the cause..

I saw that Deepstream expects this kind of yolo network output: INFO: [Implicit Engine Info]: layers num: 3 0 INPUT kFLOAT input 3x704x704 1 OUTPUT kFLOAT boxes 30492x1x4 2 OUTPUT kFLOAT confs 30492x2

But your ONNX generator creates this output: INFO: [Implicit Engine Info]: layers num: 4 0 INPUT kFLOAT 000_net 3x704x704 1 OUTPUT kFLOAT 139_convolutional 21x88x88 2 OUTPUT kFLOAT 150_convolutional 21x44x44 3 OUTPUT kFLOAT 161_convolutional 21x22x22

I would like to make it work with your conversion code but I'm stuck. Anyway, great work and your blog posts were very helpful, I manage to run FP32 and FP16 at least 👍

jkjung-avt commented 4 years ago

I do have some experience in using INT8 mode and DLA cores (on Xavier NX) for TensorRT engines. So it shouldn't be too difficult for me to put together a sample to demonstrate that. The only problem is that I have to find time to do it.

Supporting DeepStream is a different story, though. My CUDA kernel implementation of the "yolo_layer" plugin (source code here) utilizes the GPU to do more work in parallel, comparing to NVIDIA's deepstream_reference_apps/yolo implementation. I think my implementation is better (runs overall faster on the Jetson's). Each of my "yolo_layer" plugins does output "7 num_anchors num_grids" float32 numbers, as explained here. That is indeed different from what DeepStream expects. I don't think I have the time/energy to port and test the code for DeepStream in the near future...

Unfixab1e commented 4 years ago

Thanks for the quick response! I will check out the implementation of your yolo_layer and how difficult it would be to make it work with deepstream. I'm already getting around 8/25/41 FPS for FP32/FP16/INT8 modes on the Jetson Xavier running Yolov4 with 704x704 input so I'm curious what numbers I could achieve with your implementation.

The INT8 sample would be great as my entropy calibrator leads to no outputs at all during inference but I'm not sure if it's due to the different ONNX model format, my calibrator or deepstream.

Tetsujinfr commented 4 years ago

Hi , would be great to get int8 and NVDLA used on the inference side for Yolo4. Sorry to be a pest but I have no experience on how to do it so hoping you can find some time to give it a try. Thanks

yadav-sumit-zz commented 4 years ago

thanks for this great repo, int8 support for yolov4 would be much appreciated!

jkjung-avt commented 4 years ago

I think my INT8 and DLA core implementation is done. It resides in the "int8" branch of this repository: https://github.com/jkjung-avt/tensorrt_demos/tree/int8.

There are some issues, as manifested in the README. If I get answers/updates from NVIDIA, I will update the code and the README.

Otherwise, I plan to merge the "int8" branch into "master" soon.

Unfixab1e commented 4 years ago

Thanks for your work @jkjung-avt ! I will take a look at it after the weekend and try to train myself for int8~

One question regarding the DLA core test. I also tried running Yolov4 on the DLA cores but I could only run around 7% of the layers on it due to non supported layers or layers not fitting in memory. Tested under tensorrt v7.1.3 and 7.2. Did you experience something similar? I was hoping I might be able to run one model on DLA to free up GPU resources..

jkjung-avt commented 4 years ago

@Unfixab1e When building the TensorRT engines for DLA core, I did see TensorRT logging quite a few warnings of "layers not supported on DLA core, falling back to GPU". I'm not exactly sure how much of the engine runs on the DLA core and how much on GPU.

I think we could monitor GPU usage by tegrastats when running the DLA TensorRT engine.

jkjung-avt commented 4 years ago

The following is what I saw when I tried to build a DLA0 TensorRT engine of "yolov4-dla0-608" model on Jetson Xavier NX. From this log, indeed the majority of layers still run on the GPU...

[TensorRT] INFO: --------------- Layers running on DLA:
[TensorRT] INFO: {002_convolutional_tanh,002_convolutional_mish,003_convolutional,003_convolutional_bn,005_convolutional,005_convolutional_bn}, {003_convolutional_tanh
,003_convolutional_mish,005_convolutional_tanh,005_convolutional_mish,006_convolutional,006_convolutional_bn}, {012_convolutional_tanh,012_convolutional_mish,013_convo
lutional,013_convolutional_bn,015_convolutional,015_convolutional_bn}, {013_convolutional_tanh,013_convolutional_mish,015_convolutional_tanh,015_convolutional_mish,016
_convolutional,016_convolutional_bn}, {025_convolutional_tanh,025_convolutional_mish,026_convolutional,026_convolutional_bn,028_convolutional,028_convolutional_bn}, {0
26_convolutional_tanh,026_convolutional_mish,028_convolutional_tanh,028_convolutional_mish,029_convolutional,029_convolutional_bn}, {055_convolutional_tanh,055_convolu
tional_mish,056_convolutional,056_convolutional_bn,131_convolutional,131_convolutional_bn}, {056_convolutional_tanh,056_convolutional_mish,057_convolutional,057_convol
utional_bn,059_convolutional,059_convolutional_bn},                                                                                                                    
[TensorRT] INFO: --------------- Layers running on GPU:                                                                                                                
[TensorRT] INFO: 001_convolutional, 001_convolutional_softplus, 001_convolutional_tanh, 001_convolutional_mish, 002_convolutional, 002_convolutional_softplus, 003_conv
olutional_softplus, 005_convolutional_softplus, 006_convolutional_softplus, 006_convolutional_tanh, 006_convolutional_mish, 007_convolutional, 007_convolutional_softpl
us, 007_convolutional_tanh, 007_convolutional_mish, 008_shortcut, 009_convolutional, 009_convolutional_softplus, 009_convolutional_tanh, 009_convolutional_mish, 003_co
nvolutional_mish copy, 011_convolutional, 011_convolutional_softplus, 011_convolutional_tanh, 011_convolutional_mish, 012_convolutional, 012_convolutional_softplus, 01
3_convolutional_softplus, 015_convolutional_softplus, 016_convolutional_softplus, 016_convolutional_tanh, 016_convolutional_mish, 017_convolutional, 017_convolutional_
softplus, 017_convolutional_tanh, 017_convolutional_mish, 018_shortcut, 019_convolutional, 019_convolutional_softplus, 019_convolutional_tanh, 019_convolutional_mish, 
020_convolutional, 020_convolutional_softplus, 020_convolutional_tanh, 020_convolutional_mish, 021_shortcut, 022_convolutional, 022_convolutional_softplus, 022_convolu
tional_tanh, 022_convolutional_mish, 013_convolutional_mish copy, 024_convolutional, 024_convolutional_softplus, 024_convolutional_tanh, 024_convolutional_mish, 025_co
nvolutional, 025_convolutional_softplus, 026_convolutional_softplus, 028_convolutional_softplus, 029_convolutional_softplus, 029_convolutional_tanh, 029_convolutional_
mish, 030_convolutional, 030_convolutional_softplus, 030_convolutional_tanh, 030_convolutional_mish, 031_shortcut, 032_convolutional, 032_convolutional_softplus, 032_c
onvolutional_tanh, 032_convolutional_mish, 033_convolutional, 033_convolutional_softplus, 033_convolutional_tanh, 033_convolutional_mish, 034_shortcut, 035_convolution
al, 035_convolutional_softplus, 035_convolutional_tanh, 035_convolutional_mish, 036_convolutional, 036_convolutional_softplus, 036_convolutional_tanh, 036_convolutiona
l_mish, 037_shortcut, 038_convolutional, 038_convolutional_softplus, 038_convolutional_tanh, 038_convolutional_mish, 039_convolutional, 039_convolutional_softplus, 039
_convolutional_tanh, 039_convolutional_mish, 040_shortcut, 041_convolutional, 041_convolutional_softplus, 041_convolutional_tanh, 041_convolutional_mish, 042_convoluti
onal, 042_convolutional_softplus, 042_convolutional_tanh, 042_convolutional_mish, 043_shortcut, 044_convolutional, 044_convolutional_softplus, 044_convolutional_tanh,
044_convolutional_mish, 045_convolutional, 045_convolutional_softplus, 045_convolutional_tanh, 045_convolutional_mish, 046_shortcut, 047_convolutional, 047_convolution
al_softplus, 047_convolutional_tanh, 047_convolutional_mish, 048_convolutional, 048_convolutional_softplus, 048_convolutional_tanh, 048_convolutional_mish, 049_shortcu
t, 050_convolutional, 050_convolutional_softplus, 050_convolutional_tanh, 050_convolutional_mish, 051_convolutional, 051_convolutional_softplus, 051_convolutional_tanh
, 051_convolutional_mish, 052_shortcut, 053_convolutional, 053_convolutional_softplus, 053_convolutional_tanh, 053_convolutional_mish, 026_convolutional_mish copy, 055
_convolutional, 055_convolutional_softplus, 056_convolutional_softplus, 131_convolutional_lrelu, 057_convolutional_softplus, 059_convolutional_softplus, 057_convolutio
nal_tanh, 057_convolutional_mish, 059_convolutional_tanh, 059_convolutional_mish, 060_convolutional, 060_convolutional_softplus, 060_convolutional_tanh, 060_convolutio
nal_mish, 061_convolutional, 061_convolutional_softplus, 061_convolutional_tanh, 061_convolutional_mish, 062_shortcut, 063_convolutional, 063_convolutional_softplus, 0
63_convolutional_tanh, 063_convolutional_mish, 064_convolutional, 064_convolutional_softplus, 064_convolutional_tanh, 064_convolutional_mish, 065_shortcut, 066_convolu
tional, 066_convolutional_softplus, 066_convolutional_tanh, 066_convolutional_mish, 067_convolutional, 067_convolutional_softplus, 067_convolutional_tanh, 067_convolut
ional_mish, 068_shortcut, 069_convolutional, 069_convolutional_softplus, 069_convolutional_tanh, 069_convolutional_mish, 070_convolutional, 070_convolutional_softplus,
 070_convolutional_tanh, 070_convolutional_mish, 071_shortcut, 072_convolutional, 072_convolutional_softplus, 072_convolutional_tanh, 072_convolutional_mish, 073_convo
lutional, 073_convolutional_softplus, 073_convolutional_tanh, 073_convolutional_mish, 074_shortcut, 075_convolutional, 075_convolutional_softplus, 075_convolutional_ta
nh, 075_convolutional_mish, 076_convolutional, 076_convolutional_softplus, 076_convolutional_tanh, 076_convolutional_mish, 077_shortcut, 078_convolutional, 078_convolu
tional_softplus, 078_convolutional_tanh, 078_convolutional_mish, 079_convolutional, 079_convolutional_softplus, 079_convolutional_tanh, 079_convolutional_mish, 080_sho
rtcut, 081_convolutional, 081_convolutional_softplus, 081_convolutional_tanh, 081_convolutional_mish, 082_convolutional, 082_convolutional_softplus, 082_convolutional_
tanh, 082_convolutional_mish, 083_shortcut, 084_convolutional, 084_convolutional_softplus, 084_convolutional_tanh, 084_convolutional_mish, 086_convolutional, 086_convo
lutional_softplus, 086_convolutional_tanh, 086_convolutional_mish, 087_convolutional, 121_convolutional, 087_convolutional_softplus, 121_convolutional_lrelu, 087_convo
lutional_tanh, 087_convolutional_mish, 088_convolutional, 090_convolutional, 088_convolutional_softplus, 090_convolutional_softplus, 088_convolutional_tanh, 088_convol
utional_mish, 090_convolutional_tanh, 090_convolutional_mish, 091_convolutional, 091_convolutional_softplus, 091_convolutional_tanh, 091_convolutional_mish, 092_convol
utional, 092_convolutional_softplus, 092_convolutional_tanh, 092_convolutional_mish, 093_shortcut, 094_convolutional, 094_convolutional_softplus, 094_convolutional_tan
h, 094_convolutional_mish, 095_convolutional, 095_convolutional_softplus, 095_convolutional_tanh, 095_convolutional_mish, 096_shortcut, 097_convolutional, 097_convolut
ional_softplus, 097_convolutional_tanh, 097_convolutional_mish, 098_convolutional, 098_convolutional_softplus, 098_convolutional_tanh, 098_convolutional_mish, 099_shor
tcut, 100_convolutional, 100_convolutional_softplus, 100_convolutional_tanh, 100_convolutional_mish, 101_convolutional, 101_convolutional_softplus, 101_convolutional_t
anh, 101_convolutional_mish, 102_shortcut, 103_convolutional, 103_convolutional_softplus, 103_convolutional_tanh, 103_convolutional_mish, 105_convolutional, 105_convol
utional_softplus, 105_convolutional_tanh, 105_convolutional_mish, 106_convolutional, 106_convolutional_lrelu, 107_convolutional, 107_convolutional_lrelu, 108_convoluti
onal, 108_convolutional_lrelu, 111_maxpool, 113_maxpool, 109_maxpool, 108_convolutional_lrelu copy, 115_convolutional, 115_convolutional_lrelu, 116_convolutional, 116_
convolutional_lrelu, 117_convolutional, 117_convolutional_lrelu, 118_convolutional, 118_convolutional_lrelu, 119_upsample, 119_upsample copy, 123_convolutional, 123_co
nvolutional_lrelu, 124_convolutional, 124_convolutional_lrelu, 125_convolutional, 125_convolutional_lrelu, 126_convolutional, 126_convolutional_lrelu, 127_convolutiona
l, 127_convolutional_lrelu, 128_convolutional, 128_convolutional_lrelu, 129_upsample, 129_upsample copy, 133_convolutional, 133_convolutional_lrelu, 134_convolutional,
 134_convolutional_lrelu, 135_convolutional, 135_convolutional_lrelu, 136_convolutional, 136_convolutional_lrelu, 137_convolutional, 137_convolutional_lrelu, 138_convo
lutional, 142_convolutional, 138_convolutional_lrelu, 142_convolutional_lrelu, 139_convolutional, 144_convolutional, 144_convolutional_lrelu, (Unnamed Layer* 506) [Plu
ginV2IOExt], 145_convolutional, 145_convolutional_lrelu, 146_convolutional, 146_convolutional_lrelu, 147_convolutional, 147_convolutional_lrelu, 148_convolutional, 148
_convolutional_lrelu, 149_convolutional, 153_convolutional, 149_convolutional_lrelu, 153_convolutional_lrelu, 150_convolutional, 155_convolutional, 155_convolutional_l
relu, (Unnamed Layer* 507) [PluginV2IOExt], 156_convolutional, 156_convolutional_lrelu, 157_convolutional, 157_convolutional_lrelu, 158_convolutional, 158_convolutiona
l_lrelu, 159_convolutional, 159_convolutional_lrelu, 160_convolutional, 160_convolutional_lrelu, 161_convolutional, (Unnamed Layer* 508) [PluginV2IOExt],

Unfixab1e commented 4 years ago

Yes looks nearly the same as my TensorRT output. I took a quick look at GPU utilization between the DLA model and the regular GPU model and it seems to be the same. So I guess there is no reason to offload a few layers to DLA, at least at the current state of support.

My output:

--------------- Layers running on DLA: [09/30/2020-13:12:43] [I] [TRT] {Tanh_3,Mul_4,Conv_5,BatchNormalization_6,scale_operand_of_Sub_705,scale_operand_of_Sub_1073,scale_operand_of_Sub_1441}, {Tanh_8,Mul_9,Conv_10,Conv_15,BatchNormalization_11,BatchNormalization_16}, {Tanh_13,Tanh_18,Mul_14,Mul_19,Conv_20,BatchNormalization_21}, {Tanh_45,Mul_46,Conv_47,Conv_52,BatchNormalization_48,BatchNormalization_53}, {Tanh_50,Tanh_55,Mul_51,Mul_56,Conv_57,BatchNormalization_58}, {Tanh_93,Mul_94,Conv_95,Conv_100,BatchNormalization_96,BatchNormalization_101}, {Tanh_98,Tanh_103,Mul_99,Mul_104,Conv_105,BatchNormalization_106}, {Tanh_202,Mul_203,Conv_204,Conv_567,BatchNormalization_205,BatchNormalization_568}, [09/30/2020-13:12:43] [I] [TRT] --------------- Layers running on GPU: [09/30/2020-13:12:43] [I] [TRT] Conv_0, 2253[Constant], 1865[Constant], 1477[Constant], (Unnamed Layer* 692) [Constant] + (Unnamed Layer* 693) [Shuffle], (Unnamed Layer* 970) [Constant] + (Unnamed Layer* 971) [Shuffle], (Unnamed Layer* 1248) [Constant] + (Unnamed Layer* 1249) [Shuffle], Softplus_2, Cast_802, Cast_807, (Unnamed Layer* 750) [Shuffle], (Unnamed Layer* 756) [Shuffle], Cast_1170, Cast_1175, (Unnamed Layer* 1028) [Shuffle], (Unnamed Layer* 1034) [Shuffle], Cast_1538, Cast_1543, (Unnamed Layer* 1306) [Shuffle], (Unnamed Layer* 1312) [Shuffle], Softplus_7, Softplus_12, Softplus_17, PWN(PWN(Softplus_22, Tanh_23), Mul_24), Conv_25, PWN(PWN(PWN(Softplus_27, Tanh_28), Mul_29), Add_30), Conv_31, PWN(PWN(Softplus_33, Tanh_34), Mul_35), 663 copy, Conv_37, PWN(PWN(Softplus_39, Tanh_40), Mul_41), Conv_42, Softplus_44, Softplus_49, Softplus_54, PWN(PWN(Softplus_59, Tanh_60), Mul_61), Conv_62, PWN(PWN(PWN(Softplus_64, Tanh_65), Mul_66), Add_67), Conv_68, PWN(PWN(Softplus_70, Tanh_71), Mul_72), Conv_73, PWN(PWN(PWN(Softplus_75, Tanh_76), Mul_77), Add_78), Conv_79, PWN(PWN(Softplus_81, Tanh_82), Mul_83), 700 copy, Conv_85, PWN(PWN(Softplus_87, Tanh_88), Mul_89), Conv_90, Softplus_92, Softplus_97, Softplus_102, PWN(PWN(Softplus_107, Tanh_108), Mul_109), Conv_110, PWN(PWN(PWN(Softplus_112, Tanh_113), Mul_114), Add_115), Conv_116, PWN(PWN(Softplus_118, Tanh_119), Mul_120), Conv_121, PWN(PWN(PWN(Softplus_123, Tanh_124), Mul_125), Add_126), Conv_127, PWN(PWN(Softplus_129, Tanh_130), Mul_131), Conv_132, PWN(PWN(PWN(Softplus_134, Tanh_135), Mul_136), Add_137), Conv_138, PWN(PWN(Softplus_140, Tanh_141), Mul_142), Conv_143, PWN(PWN(PWN(Softplus_145, Tanh_146), Mul_147), Add_148), Conv_149, PWN(PWN(Softplus_151, Tanh_152), Mul_153), Conv_154, PWN(PWN(PWN(Softplus_156, Tanh_157), Mul_158), Add_159), Conv_160, PWN(PWN(Softplus_162, Tanh_163), Mul_164), Conv_165, PWN(PWN(PWN(Softplus_167, Tanh_168), Mul_169), Add_170), Conv_171, PWN(PWN(Softplus_173, Tanh_174), Mul_175), Conv_176, PWN(PWN(PWN(Softplus_178, Tanh_179), Mul_180), Add_181), Conv_182, PWN(PWN(Softplus_184, Tanh_185), Mul_186), Conv_187, PWN(PWN(PWN(Softplus_189, Tanh_190), Mul_191), Add_192), Conv_193, PWN(PWN(Softplus_195, Tanh_196), Mul_197), 748 copy, Conv_199, Softplus_201, LeakyRelu_569, PWN(PWN(Softplus_206, Tanh_207), Mul_208), Conv_209 || Conv_214, PWN(PWN(Softplus_211, Tanh_212), Mul_213), PWN(PWN(Softplus_216, Tanh_217), Mul_218), Conv_219, PWN(PWN(Softplus_221, Tanh_222), Mul_223), Conv_224, PWN(PWN(PWN(Softplus_226, Tanh_227), Mul_228), Add_229), Conv_230, PWN(PWN(Softplus_232, Tanh_233), Mul_234), Conv_235, PWN(PWN(PWN(Softplus_237, Tanh_238), Mul_239), Add_240), Conv_241, PWN(PWN(Softplus_243, Tanh_244), Mul_245), Conv_246, PWN(PWN(PWN(Softplus_248, Tanh_249), Mul_250), Add_251), Conv_252, PWN(PWN(Softplus_254, Tanh_255), Mul_256), Conv_257, PWN(PWN(PWN(Softplus_259, Tanh_260), Mul_261), Add_262), Conv_263, PWN(PWN(Softplus_265, Tanh_266), Mul_267), Conv_268, PWN(PWN(PWN(Softplus_270, Tanh_271), Mul_272), Add_273), Conv_274, PWN(PWN(Softplus_276, Tanh_277), Mul_278), Conv_279, PWN(PWN(PWN(Softplus_281, Tanh_282), Mul_283), Add_284), Conv_285, PWN(PWN(Softplus_287, Tanh_288), Mul_289), Conv_290, PWN(PWN(PWN(Softplus_292, Tanh_293), Mul_294), Add_295), Conv_296, PWN(PWN(Softplus_298, Tanh_299), Mul_300), Conv_301, PWN(PWN(PWN(Softplus_303, Tanh_304), Mul_305), Add_306), Conv_307, PWN(PWN(Softplus_309, Tanh_310), Mul_311), Conv_313, PWN(PWN(Softplus_315, Tanh_316), Mul_317), Conv_318, Conv_479, LeakyRelu_481, PWN(PWN(Softplus_320, Tanh_321), Mul_322), Conv_323 || Conv_328, PWN(PWN(Softplus_325, Tanh_326), Mul_327), PWN(PWN(Softplus_330, Tanh_331), Mul_332), Conv_333, PWN(PWN(Softplus_335, Tanh_336), Mul_337), Conv_338, PWN(PWN(PWN(Softplus_340, Tanh_341), Mul_342), Add_343), Conv_344, PWN(PWN(Softplus_346, Tanh_347), Mul_348), Conv_349, PWN(PWN(PWN(Softplus_351, Tanh_352), Mul_353), Add_354), Conv_355, PWN(PWN(Softplus_357, Tanh_358), Mul_359), Conv_360, PWN(PWN(PWN(Softplus_362, Tanh_363), Mul_364), Add_365), Conv_366, PWN(PWN(Softplus_368, Tanh_369), Mul_370), Conv_371, PWN(PWN(PWN(Softplus_373, Tanh_374), Mul_375), Add_376), Conv_377, PWN(PWN(Softplus_379, Tanh_380), Mul_381), Conv_383, PWN(PWN(Softplus_385, Tanh_386), Mul_387), Conv_388, LeakyRelu_390, Conv_391, LeakyRelu_393, Conv_394, LeakyRelu_396, MaxPool_398, MaxPool_399, MaxPool_397, 1045 copy, Conv_401, LeakyRelu_403, Conv_404, LeakyRelu_406, Conv_407, LeakyRelu_409, Conv_410, LeakyRelu_412, Reshape_430, Expand_456, Reshape_478, Conv_483, LeakyRelu_485, Conv_486, LeakyRelu_488, Conv_489, LeakyRelu_491, Conv_492, LeakyRelu_494, Conv_495, LeakyRelu_497, Conv_498, LeakyRelu_500, Reshape_518, Expand_544, Reshape_566, Conv_571, LeakyRelu_573, Conv_574, LeakyRelu_576, Conv_577, LeakyRelu_579, Conv_580, LeakyRelu_582, Conv_583, LeakyRelu_585, Conv_586, Conv_935, LeakyRelu_588, LeakyRelu_937, Conv_589, Conv_939, Slice_594, Slice_599, Slice_604, Slice_609, Slice_614, Slice_619, Slice_624, Slice_629, Slice_634, Slice_639, Slice_644, Slice_649, LeakyRelu_941, Conv_942, PWN(PWN(Sigmoid_701, (Unnamed Layer* 689) [Constant] + (Unnamed Layer* 690) [Shuffle] + Mul_703), Sub_705), Exp_706, Slice_727, Slice_734, Slice_755, Slice_762, Slice_783, Slice_790, Slice_713, Slice_720, Slice_741, Slice_748, Slice_769, Slice_776, Reshape_683 + Transpose_684, LeakyRelu_944, Reshape_668, Sigmoid_707, Reshape_700, Reshape_933, PWN(Sigmoid_708, Mul_934), (Unnamed Layer* 705) [Constant] + (Unnamed Layer* 706) [Shuffle] + Mul_729, (Unnamed Layer* 709) [Constant] + (Unnamed Layer* 710) [Shuffle] + Mul_736, (Unnamed Layer* 719) [Constant] + (Unnamed Layer* 720) [Shuffle] + Mul_757, (Unnamed Layer* 723) [Constant] + (Unnamed Layer* 724) [Shuffle] + Mul_764, (Unnamed Layer* 733) [Constant] + (Unnamed Layer* 734) [Shuffle] + Mul_785, (Unnamed Layer* 737) [Constant] + (Unnamed Layer* 738) [Shuffle] + Mul_792, (Unnamed Layer* 699) [Constant] + Add_715, (Unnamed Layer* 702) [Constant] + Add_722, (Unnamed Layer* 713) [Constant] + Add_743, (Unnamed Layer* 716) [Constant] + Add_750, (Unnamed Layer* 727) [Constant] + Add_771, (Unnamed Layer* 730) [Constant] + Add_778, Conv_945, 1464 copy, 1466 copy, 1465 copy, 1467 copy, LeakyRelu_947, Div_803, Div_808, Slice_813, Slice_855, Slice_834, Slice_876, Reshape_829, Reshape_871, Reshape_850, Reshape_892, PWN(PWN((Unnamed Layer* 830) [Constant] + (Unnamed Layer* 831) [Shuffle], Mul_894), Sub_895), PWN(PWN((Unnamed Layer* 834) [Constant] + (Unnamed Layer* 835) [Shuffle], Mul_897), Sub_898), Add_899, Add_900, 1574 copy, 1577 copy, 1578 copy, 1579 copy, Reshape_917, Conv_948, LeakyRelu_950, Conv_951, LeakyRelu_953, Conv_954, Conv_1303, LeakyRelu_956, LeakyRelu_1305, Conv_957, 2010 copy, 1058 copy, Conv_1307, Slice_962, Slice_967, Slice_972, Slice_977, Slice_982, Slice_987, Slice_992, Slice_997, Slice_1002, Slice_1007, Slice_1012, Slice_1017, LeakyRelu_1309, 1647 copy, 1667 copy, 1687 copy, 1652 copy, 1672 copy, 1692 copy, 1657 copy, 1677 copy, 1697 copy, 1662 copy, 1682 copy, 1702 copy, Conv_1310, PWN(PWN(Sigmoid_1069, (Unnamed Layer* 967) [Constant] + (Unnamed Layer* 968) [Shuffle] + Mul_1071), Sub_1073), Exp_1074, Slice_1095, Slice_1102, Slice_1123, Slice_1130, Slice_1151, Slice_1158, Slice_1081, Slice_1088, Slice_1109, Slice_1116, Slice_1137, Slice_1144, Reshape_1051 + Transpose_1052, LeakyRelu_1312, Reshape_1036, Sigmoid_1075, Reshape_1068, Reshape_1301, PWN(Sigmoid_1076, Mul_1302), (Unnamed Layer* 983) [Constant] + (Unnamed Layer* 984) [Shuffle] + Mul_1097, (Unnamed Layer* 987) [Constant] + (Unnamed Layer* 988) [Shuffle] + Mul_1104, (Unnamed Layer* 997) [Constant] + (Unnamed Layer* 998) [Shuffle] + Mul_1125, (Unnamed Layer* 1001) [Constant] + (Unnamed Layer* 1002) [Shuffle] + Mul_1132, (Unnamed Layer* 1011) [Constant] + (Unnamed Layer* 1012) [Shuffle] + Mul_1153, (Unnamed Layer* 1015) [Constant] + (Unnamed Layer* 1016) [Shuffle] + Mul_1160, 1788 copy, 1816 copy, 1844 copy, 1795 copy, 1823 copy, 1851 copy, (Unnamed Layer* 977) [Constant] + Add_1083, (Unnamed Layer* 980) [Constant] + Add_1090, (Unnamed Layer* 991) [Constant] + Add_1111, (Unnamed Layer* 994) [Constant] + Add_1118, (Unnamed Layer* 1005) [Constant] + Add_1139, (Unnamed Layer* 1008) [Constant] + Add_1146, Conv_1313, 1774 copy, 1802 copy, 1830 copy, 1781 copy, 1809 copy, 1837 copy, 1852 copy, 1854 copy, 1853 copy, 1855 copy, LeakyRelu_1315, Div_1171, Div_1176, Slice_1181, Slice_1223, Slice_1202, Slice_1244, Reshape_1197, Reshape_1239, Reshape_1218, Reshape_1260, PWN(PWN((Unnamed Layer* 1108) [Constant] + (Unnamed Layer* 1109) [Shuffle], Mul_1262), Sub_1263), PWN(PWN((Unnamed Layer* 1112) [Constant] + (Unnamed Layer* 1113) [Shuffle], Mul_1265), Sub_1266), Add_1267, Add_1268, 1962 copy, 1965 copy, 1966 copy, 1967 copy, Reshape_1285, Conv_1316, LeakyRelu_1318, Conv_1319, LeakyRelu_1321, Conv_1322, LeakyRelu_1324, Conv_1325, Slice_1330, Slice_1335, Slice_1340, Slice_1345, Slice_1350, Slice_1355, Slice_1360, Slice_1365, Slice_1370, Slice_1375, Slice_1380, Slice_1385, 2035 copy, 2055 copy, 2075 copy, 2040 copy, 2060 copy, 2080 copy, 2045 copy, 2065 copy, 2085 copy, 2050 copy, 2070 copy, 2090 copy, PWN(PWN(Sigmoid_1437, (Unnamed Layer* 1245) [Constant] + (Unnamed Layer* 1246) [Shuffle] + Mul_1439), Sub_1441), Exp_1442, Slice_1463, Slice_1470, Slice_1491, Slice_1498, Slice_1519, Slice_1526, Slice_1449, Slice_1456, Slice_1477, Slice_1484, Slice_1505, Slice_1512, Reshape_1419 + Transpose_1420, Reshape_1404, Sigmoid_1443, Reshape_1436, Reshape_1669, PWN(Sigmoid_1444, Mul_1670), 1619 copy, 2007 copy, 2395 copy, (Unnamed Layer* 1261) [Constant] + (Unnamed Layer* 1262) [Shuffle] + Mul_1465, (Unnamed Layer* 1265) [Constant] + (Unnamed Layer* 1266) [Shuffle] + Mul_1472, (Unnamed Layer* 1275) [Constant] + (Unnamed Layer* 1276) [Shuffle] + Mul_1493, (Unnamed Layer* 1279) [Constant] + (Unnamed Layer* 1280) [Shuffle] + Mul_1500, (Unnamed Layer* 1289) [Constant] + (Unnamed Layer* 1290) [Shuffle] + Mul_1521, (Unnamed Layer* 1293) [Constant] + (Unnamed Layer* 1294) [Shuffle] + Mul_1528, 2176 copy, 2204 copy, 2232 copy, 2183 copy, 2211 copy, 2239 copy, (Unnamed Layer* 1255) [Constant] + Add_1451, (Unnamed Layer* 1258) [Constant] + Add_1458, (Unnamed Layer* 1269) [Constant] + Add_1479, (Unnamed Layer* 1272) [Constant] + Add_1486, (Unnamed Layer* 1283) [Constant] + Add_1507, (Unnamed Layer* 1286) [Constant] + Add_1514, 2162 copy, 2190 copy, 2218 copy, 2169 copy, 2197 copy, 2225 copy, 2240 copy, 2242 copy, 2241 copy, 2243 copy, Div_1539, Div_1544, Slice_1549, Slice_1591, Slice_1570, Slice_1612, Reshape_1565, Reshape_1607, Reshape_1586, Reshape_1628, PWN(PWN((Unnamed Layer* 1386) [Constant] + (Unnamed Layer* 1387) [Shuffle], Mul_1630), Sub_1631), PWN(PWN((Unnamed Layer* 1390) [Constant] + (Unnamed Layer* 1391) [Shuffle], Mul_1633), Sub_1634), Add_1635, Add_1636, 2350 copy, 2353 copy, 2354 copy, 2355 copy, Reshape_1653, 1600 copy, 1988 copy, 2376 copy,

jkjung-avt commented 4 years ago

I don't have anything to share for DeepStream at this moment. I'll close this issue since INT8 implementation is done.

m-kzein commented 4 years ago

@jkjung-avt I am sorry for opening this again. I am very impressed with your work and specially the latest addition of INT8 inference and DLA. How hard it is to make the model run on both DLAs and the GPU at the same time? Thanks.

Unfixab1e commented 4 years ago

@MohammadKassemZein : I did some more profiling on DLA+GPU inference for Yolov4 object detection and deeplab semantic segmentation and found that in both cases memory footprint goes slightly up while speed goes down significantly. So even if you run inference partly on the DLA core you don't free up resources with these 2 networks. You could try some smaller network like tiny yolo or some small resnet and see if you can execute it on the DLA cores but for the 2 models listed there is no use in running them on the DLA cores.

m-kzein commented 4 years ago

@Unfixab1e What I mean is that NVIDIA benchmark for YOLOv3-tiny is ~550fps (on jetson NX), given that the model runs on the 2DLA and GPU with highest frequency. link: https://github.com/NVIDIA-AI-IOT/jetson_benchmarks However, they do not provide how to actually to converge to such model; So, I was wondering since @jkjung-avt was able to build the models for both DLA cores, how can I achieve a speed near the benchmark provided by NVIDIA?

jkjung-avt commented 4 years ago

@MohammadKassemZein I concur with @Unfixab1e. I tested yolov4-608 and ran 3 instances (3 processes) of trt_yolo.py on my Jetson Xavier NX, using DLA0, DLA1 and GPU respectively. But the aggregated FPS number is actually lower than just running 1 engine on the GPU alone. To make good use of the DLA cores, you'll need to design the model so that it could run completely on the DLA core.

Reference: https://forums.developer.nvidia.com/t/yolov3-fps-on-tensorrt/109982/23

jkjung-avt commented 4 years ago

However, they do not provide how to actually to converge to such model; So, I was wondering since @jkjung-avt was able to build the models for both DLA cores, how can I achieve a speed near the benchmark provided by NVIDIA?

@MohammadKassemZein Referring to https://github.com/jkjung-avt/tensorrt_demos/issues/200#issuecomment-703056235, you could see that most of the yolov4 layers still run on the GPU (not DLA core).

m-kzein commented 4 years ago

@jkjung-avt Thanks! I see. So, what do you suggest if I want to increase my inference speed to beyond 100fps (regardless of whether it is yolov3-tiny or yolov4-tiny).

jkjung-avt commented 4 years ago

I measured how much time the "inference_fn()" takes on my Jetson Xavier NX DevKit.

https://github.com/jkjung-avt/tensorrt_demos/blob/e039e0824876d46443aa19e9ad7cf8f7723c713e/utils/yolo_with_plugins.py#L279-L284

For the "yolov4-tiny-416" (FP16) and "yolov4-tiny-int8-416" (INT8) engines, it actually only takes roughly 6.4 ms and 5.1 ms respectively. Comparing against the FPS numbers in my README.md, I could conclude that a significant portion of time has been spent on image preprocessing, postprocessing and display.

TensorRT engine	FP16	INT8
yolov4-tiny-416	57	60

So I think, if you use C++ to parallelize/pipeline preprocessing, inferencing and postprocessing, you could speed up the whole video pipeline quite a bit and achieve >100 FPS with "yolov4-tiny-int8-416" easily. One good way to achieve that is to use NVIDIA DeepStream SDK...

By the way, I've also discussed my own opinion about how to achieve the best FPS performance on Jetson platforms here and here. Please take a look.

m-kzein commented 4 years ago

@jkjung-avt Thank you a lot for your help!

zyxcambridge commented 3 years ago

Thanks for the quick response! I will check out the implementation of your yolo_layer and how difficult it would be to make it work with deepstream. I'm already getting around 8/25/41 FPS for FP32/FP16/INT8 modes on the Jetson Xavier running Yolov4 with 704x704 input so I'm curious what numbers I could achieve with your implementation.

The INT8 sample would be great as my entropy calibrator leads to no outputs at all during inference but I'm not sure if it's due to the different ONNX model format, my calibrator or deepstream.

xavier nx or xavier agx ?

jkjung-avt / tensorrt_demos

Int8 Inference and deployment on deepstream #200