<Interval inference> slower than <continuous inference> when use faster-rcnn TRT model C++ mmdeploy inference server!!!

Joyphy commented 2 years ago

When I use faster-rcnn TRT model inference server, there is no error reported, it works well. But I found a strange phenomenon that when I try to send a series of pictures to model at the same time, it cost about 75ms/img for the model to deal with. But when I send the pictures with a time interval about 2s, the time consuming became 180-200ms/img. Is there any problem with the model I built?

ps. I change the mmdeploy c++ object_detection.exe demo into a server, wait for images without releasing model handle.

Joyphy commented 2 years ago

NormalvsSlow Nvidia visual Profiler result

Joyphy commented 2 years ago

normaltimevsslowtime inference time compare

Joyphy commented 2 years ago

Simple reproduction example(interface is python ctypes)

import os, sys, time

os.chdir(os.path.dirname(os.path.realpath(__file__)))
sys.path.append(os.path.dirname(os.path.realpath(__file__)))

from mmdetC_interface.mmdet_od_interface import *

import cv2

image = cv2.imread('demo.jpg', 1)

load_model("./models/win10_1660ti_fp16", 0)

for i in range(15):
    t1 = time.time()
    inference(image)
    t2 = time.time()
    print(f"inference time: {(t2 - t1)*1000:.2f}ms\n")
    # time.sleep(2)
    # print(f"sleep 2s...\n")

Joyphy commented 2 years ago

2022-08-03 17:32:20,828 - mmdeploy - INFO - **Environmental information** fatal: not a git repository (or any of the parent directories): .git 2022-08-03 17:33:03,993 - mmdeploy - INFO - sys.platform: win32 2022-08-03 17:33:03,993 - mmdeploy - INFO - Python: 3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)] 2022-08-03 17:33:03,993 - mmdeploy - INFO - CUDA available: True 2022-08-03 17:33:03,993 - mmdeploy - INFO - GPU 0: NVIDIA GeForce GTX 1660 Ti 2022-08-03 17:33:03,993 - mmdeploy - INFO - CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3 2022-08-03 17:33:03,993 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 11.3, V11.3.58 2022-08-03 17:33:03,993 - mmdeploy - INFO - MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.29.30136 版 2022-08-03 17:33:03,994 - mmdeploy - INFO - GCC: n/a 2022-08-03 17:33:03,994 - mmdeploy - INFO - PyTorch: 1.11.0+cu113 2022-08-03 17:33:03,994 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:

C++ Version: 199711
MSVC 192829337
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 2019
LAPACK is enabled (usually provided by MKL)
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.4
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

2022-08-03 17:33:03,994 - mmdeploy - INFO - TorchVision: 0.12.0+cu113 2022-08-03 17:33:03,995 - mmdeploy - INFO - OpenCV: 4.6.0 2022-08-03 17:33:03,996 - mmdeploy - INFO - MMCV: 1.5.3 2022-08-03 17:33:03,996 - mmdeploy - INFO - MMCV Compiler: MSVC 192930140 2022-08-03 17:33:03,997 - mmdeploy - INFO - MMCV CUDA Compiler: 11.3 2022-08-03 17:33:03,997 - mmdeploy - INFO - MMDeploy: 0.5.0+ 2022-08-03 17:33:03,997 - mmdeploy - INFO -

2022-08-03 17:33:03,998 - mmdeploy - INFO - **Backend information** 2022-08-03 17:33:04,765 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True 2022-08-03 17:33:04,807 - mmdeploy - INFO - tensorrt: 8.2.3.0 ops_is_avaliable : True 2022-08-03 17:33:04,839 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False 2022-08-03 17:33:04,848 - mmdeploy - INFO - pplnn_is_avaliable: False 2022-08-03 17:33:04,856 - mmdeploy - INFO - openvino_is_avaliable: False 2022-08-03 17:33:04,856 - mmdeploy - INFO -

2022-08-03 17:33:04,856 - mmdeploy - INFO - **Codebase information** 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmdet: 2.25.0 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmseg: None 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmcls: None 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmocr: None 2022-08-03 17:33:04,865 - mmdeploy - INFO - mmedit: None 2022-08-03 17:33:04,865 - mmdeploy - INFO - mmdet3d: None 2022-08-03 17:33:04,866 - mmdeploy - INFO - mmpose: None 2022-08-03 17:33:04,866 - mmdeploy - INFO - mmrotate: None

lzhangzz commented 2 years ago

This is why there are warm-up iterations in speed benchmarks. The GPU frequency is lower when it's in an idle state. You may try to lock the GPU performance level and see how it goes.

Joyphy commented 2 years ago

Thanks for your advise. I suspected the same problem at first. But it didn't work when I tried "nvidia-smi -lgc" "nvidia-smi -lmc", even through some means to stimulate gpu to work in a high frequency state. As for warm-up, I discarded the inference time results of the previous several times. According to my experience, other model inference servers(include some trt model server) no longer need warm-up. This is indeed a seemingly incredible problem, but after I have used up all the debugging methods and monitoring methods, I cannot get a reasonable explanation myself now. I would be grateful if you could help me answer my doubts or solve problems. Also, If you want to reproduce, I can provide more detailed code.

lzhangzz commented 2 years ago

In my test results, inserting a 2 second interval incurs 10-30% more latency.

Joyphy commented 2 years ago

It seems that your results are good. I am using the model for testing. My result is usually 110% more.

I can provide more testing experience, If I use the "fp32 faster-rcnn" model, "1.9s sleep" would suddenly slows down. If I use the "fp16 faster-rcnn" model, "0.07s sleep" would slows down 110%, "0.05s sleep" would slows down 50%.

Maybe you can increase sleep time and try again. I feel that my current results have a lot to do with the resources occupied by the model. Do you think this has anything to do with the version of the library??? For example, my tensorrt version is 8.2.3.

Joyphy commented 2 years ago

Until now, the phenomena reflected have felt related to resource scheduling and release strategies. But I think we should try our best to solve this problem or minimize this deviation. Because this will greatly affect the stability of the model deployment service.

Joyphy commented 2 years ago

More test results!

I completely solved this problem on Tesla T4 and 30 series GPU. The reason is also very simple, 30series GPU with high driver version support nvidia-smi -lgc and nvidia-smi -lmc, There is almost no deviation in inference time when I set the core clock and memory clock to the maximum. I didn't succeed before because my laptop 1660ti only supports nvidia-smi -lgc. 1660ti memory clock still has a gradual upward process when inference begins. This leads to a very large number of computational kernel functions working at a low clock.

Joyphy commented 2 years ago

A better way?

Although in many modern gpu platforms, the problem of inference time deviation can be solved by locking clock. But can we find a better way to quickly switch the gpu to maximum clock before inferencing? Instead of relying on gpu automatic scheduling policy, This may cause most of the kernel functions to work at low clock in a compact model when cold boot.

hosea7456 commented 1 year ago

More test results!

I completely solved this problem on Tesla T4 and 30 series GPU. The reason is also very simple, 30series GPU with high driver version support nvidia-smi -lgc and nvidia-smi -lmc, There is almost no deviation in inference time when I set the core clock and memory clock to the maximum. I didn't succeed before because my laptop 1660ti only supports nvidia-smi -lgc. 1660ti memory clock still has a gradual upward process when inference begins. This leads to a very large number of computational kernel functions working at a low clock.

Hallo, I am facing the same problem. I has set nvidia-smi -lgc and nvidia-smi -lmc to the maximum, but it still takes double time compared to continuous inference. I wonder if you have any suggestions? Thank you!

open-mmlab / mmdeploy