No improvement using fp16 mode

zjujh1995 commented 2 years ago

Hardware: V100 Model: Swin-tiny

trt engine size: 115M (fp16) vs 116M (fp32) throughput: 184 (fp16) vs 178 (fp32)

All the processes were kept the same as your description. Any suggestions? Many thanks!

maggiez0138 commented 2 years ago

Hardware: V100 Model: Swin-tiny

trt engine size: 115M (fp16) vs 116M (fp32) throughput: 184 (fp16) vs 178 (fp32)

All the processes were kept the same as your description. Any suggestions? Many thanks!

Sorry for late reply. Hope you have solved this issue.

Please verify the speedup performance on Ampere or Turing GPUs, such as A100, A10, A30, T4, etc. If the same performance, please tell me. Thanks.

GeneralJing commented 1 year ago

NVIDIA GeForce RTX 3090 Model: Swin-tiny

TRT model path: ./weights/swin_tiny_patch4_window7_224_batch32_fp32.engine [11/14/2023-08:42:58] [TRT] [I] Loaded engine size: 110 MiB [11/14/2023-08:42:58] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +109, now: CPU 0, GPU 109 (MiB) [11/14/2023-08:43:02] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +837, now: CPU 0, GPU 946 (MiB) [11/14/2023-08:43:02] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars Processed 4000 images. Processed 8000 images. Processed 12000 images. Processed 16000 images. Processed 20000 images. Processed 24000 images. Processed 28000 images. Processed 32000 images. Processed 36000 images. Processed 40000 images. Processed 44000 images. Processed 48000 images. Evaluation of TRT model on 49984 images: 0.8118798015364916, fps: 409.2781068788955 Duration: 122.12722635269165

TRT model path: ./weights/swin_tiny_patch4_window7_224_batch32_fp16.engine [11/14/2023-08:38:54] [TRT] [I] Loaded engine size: 57 MiB [11/14/2023-08:38:55] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +54, now: CPU 0, GPU 54 (MiB) [11/14/2023-08:38:58] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +419, now: CPU 0, GPU 473 (MiB) [11/14/2023-08:38:58] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars Processed 4000 images. Processed 8000 images. Processed 12000 images. Processed 16000 images. Processed 20000 images. Processed 24000 images. Processed 28000 images. Processed 32000 images. Processed 36000 images. Processed 40000 images. Processed 44000 images. Processed 48000 images. Evaluation of TRT model on 49984 images: 0.8118798015364916, fps: 376.44877312701504 Duration: 132.7776939868927

Is it caused by the difference of GPU types?

maggiez0138 / Swin-Transformer-TensorRT

No improvement using fp16 mode #7