Open Joyphy opened 2 years ago
Nvidia visual Profiler result
inference time compare
import os, sys, time
os.chdir(os.path.dirname(os.path.realpath(__file__)))
sys.path.append(os.path.dirname(os.path.realpath(__file__)))
from mmdetC_interface.mmdet_od_interface import *
import cv2
image = cv2.imread('demo.jpg', 1)
load_model("./models/win10_1660ti_fp16", 0)
for i in range(15):
t1 = time.time()
inference(image)
t2 = time.time()
print(f"inference time: {(t2 - t1)*1000:.2f}ms\n")
# time.sleep(2)
# print(f"sleep 2s...\n")
2022-08-03 17:32:20,828 - mmdeploy - INFO - **Environmental information** fatal: not a git repository (or any of the parent directories): .git 2022-08-03 17:33:03,993 - mmdeploy - INFO - sys.platform: win32 2022-08-03 17:33:03,993 - mmdeploy - INFO - Python: 3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)] 2022-08-03 17:33:03,993 - mmdeploy - INFO - CUDA available: True 2022-08-03 17:33:03,993 - mmdeploy - INFO - GPU 0: NVIDIA GeForce GTX 1660 Ti 2022-08-03 17:33:03,993 - mmdeploy - INFO - CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3 2022-08-03 17:33:03,993 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 11.3, V11.3.58 2022-08-03 17:33:03,993 - mmdeploy - INFO - MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.29.30136 版 2022-08-03 17:33:03,994 - mmdeploy - INFO - GCC: n/a 2022-08-03 17:33:03,994 - mmdeploy - INFO - PyTorch: 1.11.0+cu113 2022-08-03 17:33:03,994 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
2022-08-03 17:33:03,994 - mmdeploy - INFO - TorchVision: 0.12.0+cu113 2022-08-03 17:33:03,995 - mmdeploy - INFO - OpenCV: 4.6.0 2022-08-03 17:33:03,996 - mmdeploy - INFO - MMCV: 1.5.3 2022-08-03 17:33:03,996 - mmdeploy - INFO - MMCV Compiler: MSVC 192930140 2022-08-03 17:33:03,997 - mmdeploy - INFO - MMCV CUDA Compiler: 11.3 2022-08-03 17:33:03,997 - mmdeploy - INFO - MMDeploy: 0.5.0+ 2022-08-03 17:33:03,997 - mmdeploy - INFO -
2022-08-03 17:33:03,998 - mmdeploy - INFO - **Backend information** 2022-08-03 17:33:04,765 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True 2022-08-03 17:33:04,807 - mmdeploy - INFO - tensorrt: 8.2.3.0 ops_is_avaliable : True 2022-08-03 17:33:04,839 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False 2022-08-03 17:33:04,848 - mmdeploy - INFO - pplnn_is_avaliable: False 2022-08-03 17:33:04,856 - mmdeploy - INFO - openvino_is_avaliable: False 2022-08-03 17:33:04,856 - mmdeploy - INFO -
2022-08-03 17:33:04,856 - mmdeploy - INFO - **Codebase information** 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmdet: 2.25.0 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmseg: None 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmcls: None 2022-08-03 17:33:04,864 - mmdeploy - INFO - mmocr: None 2022-08-03 17:33:04,865 - mmdeploy - INFO - mmedit: None 2022-08-03 17:33:04,865 - mmdeploy - INFO - mmdet3d: None 2022-08-03 17:33:04,866 - mmdeploy - INFO - mmpose: None 2022-08-03 17:33:04,866 - mmdeploy - INFO - mmrotate: None
This is why there are warm-up iterations in speed benchmarks. The GPU frequency is lower when it's in an idle state. You may try to lock the GPU performance level and see how it goes.
Thanks for your advise. I suspected the same problem at first. But it didn't work when I tried "nvidia-smi -lgc" "nvidia-smi -lmc", even through some means to stimulate gpu to work in a high frequency state. As for warm-up, I discarded the inference time results of the previous several times. According to my experience, other model inference servers(include some trt model server) no longer need warm-up. This is indeed a seemingly incredible problem, but after I have used up all the debugging methods and monitoring methods, I cannot get a reasonable explanation myself now. I would be grateful if you could help me answer my doubts or solve problems. Also, If you want to reproduce, I can provide more detailed code.
In my test results, inserting a 2 second interval incurs 10-30% more latency.
It seems that your results are good. I am using the model for testing. My result is usually 110% more.
I can provide more testing experience, If I use the "fp32 faster-rcnn" model, "1.9s sleep" would suddenly slows down. If I use the "fp16 faster-rcnn" model, "0.07s sleep" would slows down 110%, "0.05s sleep" would slows down 50%.
Maybe you can increase sleep time and try again. I feel that my current results have a lot to do with the resources occupied by the model. Do you think this has anything to do with the version of the library??? For example, my tensorrt version is 8.2.3.
Until now, the phenomena reflected have felt related to resource scheduling and release strategies. But I think we should try our best to solve this problem or minimize this deviation. Because this will greatly affect the stability of the model deployment service.
I completely solved this problem on Tesla T4 and 30 series GPU. The reason is also very simple, 30series GPU with high driver version support nvidia-smi -lgc
and nvidia-smi -lmc
, There is almost no deviation in inference time when I set the core clock and memory clock to the maximum.
I didn't succeed before because my laptop 1660ti only supports nvidia-smi -lgc
. 1660ti memory clock still has a gradual upward process when inference begins. This leads to a very large number of computational kernel functions working at a low clock.
Although in many modern gpu platforms, the problem of inference time deviation can be solved by locking clock. But can we find a better way to quickly switch the gpu to maximum clock before inferencing? Instead of relying on gpu automatic scheduling policy, This may cause most of the kernel functions to work at low clock in a compact model when cold boot.
More test results!
I completely solved this problem on Tesla T4 and 30 series GPU. The reason is also very simple, 30series GPU with high driver version support
nvidia-smi -lgc
andnvidia-smi -lmc
, There is almost no deviation in inference time when I set the core clock and memory clock to the maximum. I didn't succeed before because my laptop 1660ti only supportsnvidia-smi -lgc
. 1660ti memory clock still has a gradual upward process when inference begins. This leads to a very large number of computational kernel functions working at a low clock.
Hallo, I am facing the same problem. I has set nvidia-smi -lgc and nvidia-smi -lmc to the maximum, but it still takes double time compared to continuous inference. I wonder if you have any suggestions? Thank you!
When I use faster-rcnn TRT model inference server, there is no error reported, it works well. But I found a strange phenomenon that when I try to send a series of pictures to model at the same time, it cost about 75ms/img for the model to deal with. But when I send the pictures with a time interval about 2s, the time consuming became 180-200ms/img. Is there any problem with the model I built?
ps. I change the mmdeploy c++ object_detection.exe demo into a server, wait for images without releasing model handle.