Closed ylab604 closed 4 months ago
My recommended commands are based on benchmarking I did a while ago. I tested a lot of combinations and just picked the fastest one for the readme.
--infStreams=N Instantiate N engines to run inference concurrently (default = 1)
I tested it again with trt 9.3 on a 4090 with 1080p.
Engine 1:
trtexec --bf16 --fp16 --onnx=2x_AnimeJaNai_HD_V3_Sharp1_UltraCompact_425k_clamp_fp16_op18_onnxslim.onnx --minShapes=input:1x3x8x8 --optShapes=input:1x3x720x1280 --maxShapes=input:1x3x1080x1920 --saveEngine=2x_AnimeJaNai_HD_V3_Sharp1_UltraCompact_425k_clamp_fp16_op18_onnxslim_infStreams1.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference --useCudaGraph --noDataTransfers --builderOptimizationLevel=5 --infStreams=1
Engine 2:
trtexec --bf16 --fp16 --onnx=2x_AnimeJaNai_HD_V3_Sharp1_UltraCompact_425k_clamp_fp16_op18_onnxslim.onnx --minShapes=input:1x3x8x8 --optShapes=input:1x3x720x1280 --maxShapes=input:1x3x1080x1920 --saveEngine=2x_AnimeJaNai_HD_V3_Sharp1_UltraCompact_425k_clamp_fp16_op18_onnxslim_infStreams4.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference --useCudaGraph --noDataTransfers --builderOptimizationLevel=5 --infStreams=4
clip = core.trt.Model(
clip,
engine_path="/workspace/tensorrt/engine.engine",
num_streams=4, # using 4 for both
)
infStreams | results |
---|---|
1 | Output 2210 frames in 29.37 seconds (75.24 fps) |
4 | Output 2210 frames in 29.01 seconds (76.19 fps) |
It seems to be faster, but may depend on model or other factors.
Thank you for great work! I have a question. What is the purpose of specifying the --infStreams condition when creating an engine with trtexec? From my experience, setting infStreams=4 on a 4090 did not result in any speed improvements. Why is this argument used?