Open zjujh1995 opened 2 years ago
Hardware: V100 Model: Swin-tiny
trt engine size: 115M (fp16) vs 116M (fp32) throughput: 184 (fp16) vs 178 (fp32)
All the processes were kept the same as your description. Any suggestions? Many thanks!
Sorry for late reply. Hope you have solved this issue.
Please verify the speedup performance on Ampere or Turing GPUs, such as A100, A10, A30, T4, etc. If the same performance, please tell me. Thanks.
NVIDIA GeForce RTX 3090 Model: Swin-tiny
TRT model path: ./weights/swin_tiny_patch4_window7_224_batch32_fp32.engine
[11/14/2023-08:42:58] [TRT] [I] Loaded engine size: 110 MiB
[11/14/2023-08:42:58] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +109, now: CPU 0, GPU 109 (MiB)
[11/14/2023-08:43:02] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +837, now: CPU 0, GPU 946 (MiB)
[11/14/2023-08:43:02] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING
in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
Processed 4000 images.
Processed 8000 images.
Processed 12000 images.
Processed 16000 images.
Processed 20000 images.
Processed 24000 images.
Processed 28000 images.
Processed 32000 images.
Processed 36000 images.
Processed 40000 images.
Processed 44000 images.
Processed 48000 images.
Evaluation of TRT model on 49984 images: 0.8118798015364916, fps: 409.2781068788955
Duration: 122.12722635269165
TRT model path: ./weights/swin_tiny_patch4_window7_224_batch32_fp16.engine
[11/14/2023-08:38:54] [TRT] [I] Loaded engine size: 57 MiB
[11/14/2023-08:38:55] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +54, now: CPU 0, GPU 54 (MiB)
[11/14/2023-08:38:58] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +419, now: CPU 0, GPU 473 (MiB)
[11/14/2023-08:38:58] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING
in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
Processed 4000 images.
Processed 8000 images.
Processed 12000 images.
Processed 16000 images.
Processed 20000 images.
Processed 24000 images.
Processed 28000 images.
Processed 32000 images.
Processed 36000 images.
Processed 40000 images.
Processed 44000 images.
Processed 48000 images.
Evaluation of TRT model on 49984 images: 0.8118798015364916, fps: 376.44877312701504
Duration: 132.7776939868927
Is it caused by the difference of GPU types?
Hardware: V100 Model: Swin-tiny
trt engine size: 115M (fp16) vs 116M (fp32) throughput: 184 (fp16) vs 178 (fp32)
All the processes were kept the same as your description. Any suggestions? Many thanks!