Open chenxinfeng4 opened 1 year ago
I am actually getting a 64ms (end2end.engine) latency with the same model as above (s) running from the Release Dockerfile from the dev-1.x branch. I have tried other configurations as well. Not sure if its my graphics card, drivers, or other issue. mmdetection was installed through mim install mmdet.
Python: 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.10.0+cu113 PyTorch compiling details: PyTorch built with:
TorchVision: 0.11.0+cu113 OpenCV: 4.5.4 MMEngine: 0.7.4 MMDetection: 3.0.0+ecac3a
Please disregard my last comment. The large latency was due to the GPU failing for a hardware issue. After re-running everything I get the following for the small model on 640x640 size: +------------+---------+ | batch size | 1 | | shape | 640x640 | | iterations | 100 | | warmup | 10 | +------------+---------+ ----- Results: +--------+------------+---------+ | Stats | Latency/ms | FPS | +--------+------------+---------+ | Mean | 8.102 | 123.432 | | Median | 8.095 | 123.531 | | Min | 8.000 | 125.007 | | Max | 9.525 | 104.983 | +--------+------------+---------+
@mchaniotakis That's better than 64 ms, but still much, much slower than the 1.22 ms latency for the small model shown in Table 3 of the original research paper.
I'd like to know more about the machine used for inference. Did they super-cool it with liquid nitrogen? Was it running some custom Linux kernel? What was the CPU?
Or, if possible, could anybody provide some suggestions for getting inference speed to a bit closer to the claims made in Table 3 of the RTMDet research paper?
I think the possible explanation may be the RTMDet exports the post processing into the tensorrt model, but the document is without the post-processing version.
Evidence: The mask of MaskRCNN is 64x64 feature size (GPU), and then is transformed to HxW as imageshape (CPU, post-processing). However, RTMDet tensorrt return the mask as HxW.
I also tried benchamarking on RTX 3090 ti and couldn't go below 8ms on bigger batch size.
Describe the issue
RTMdet-ins inference fp16 model is slow than the document said. It's only
Throughput 104.89 qps
in my case, not equal to1.93ms
latency.Reproduction
Check the output
Test the speed of RTMDet-Ins-s-tensorrt-fp16
My platform is also RTX3090. But got
104.89 qps
, about9.5ms
latency. I should be 1.93ms latency in document.Environment