Prerequisite

[X] I have searched the existing and past issues but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

💬 Describe the reimplementation questions

hi, I found you have got 4.86ms TRT-FP16-Latency in DOTA dataset with RTMDet-s, and you write "The inference speed here is measured on an NVIDIA 2080Ti GPU with TensorRT 8.4.3, cuDNN 8.2.0, FP16, batch size=1, and with NMS" in README, I have tested this for same 2080Ti with TensorRT 8.2.3, cuDNN 8, FP16, batch size=1, and with NMS but I only get 24ms Latency.

The following is my test_config and convert TRT-FP16 steps:

test_pipeline = [ dict(backend_args=None, type='LoadImageFromFile'), dict(scale=( 1024, 1024, ), type='YOLOv5KeepRatioResize'), dict( allow_scale_up=False, pad_val=dict(img=114), scale=( 1024, 1024, ), type='LetterResize'), dict( meta_keys=( 'img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', 'pad_param', ), type='mmdet.PackDetInputs')]

` simplify = True fp16 = True register_all_modules() backend = MMYOLOBackend('tensorrt8') postprocess_cfg = ConfigDict( pre_top_k=1000, keep_top_k=100, iou_threshold=0.65, score_threshold=0.1)
output_names = ['num_dets', 'boxes', 'scores', 'labels'] baseModel = build_model_from_cfg(config_path, model_path, device)

deploy_model = DeployModel(baseModel=baseModel, backend=backend, postprocess_cfg=postprocess_cfg)
deploy_model.eval()

img_size = (1, 3, 1024, 1024)
fake_input = torch.randn(img_size).to(device)
deploy_model(fake_input)

save_onnx_path = os.path.join(
    work_dirs,
    os.path.basename(model_path).replace('pth', 'onnx'))
# export onnx
with BytesIO() as f:
    torch.onnx.export(
        deploy_model,
        fake_input,
        f,
        input_names=['images'],
        output_names=output_names,
        opset_version=11)
    f.seek(0)
    onnx_model = onnx.load(f)
    onnx.checker.check_model(onnx_model)

if simplify:    
    try:
        import onnxsim
        onnx_model, check = onnxsim.simplify(onnx_model)
        assert check, 'assert check failed'
    except Exception as e:
        print_log(f'Simplify failure: {e}')
onnx.save(onnx_model, save_onnx_path)
logger.info(f'ONNX export success, save into {save_onnx_path}')

#TensorRT engine convert 
input_scales = '[[1,3,1024,1024],[1,3,800,800],[1,3,1200,1200]]'   #Input scales for build dynamic input shape engin
try:
    scales = eval(input_scales)
except Exception:
    print('Input scales is not a python variable')
    print('Set scales default None')
    scales = None
builder = EngineBuilder(save_onnx_path, img_size, device)    
builder.build(scales, fp16=fp16)

The TRT-convert steps is following projects/easydeploy/export_onnx, and This is my test codes

` img_h, img_w = input_image.shape[0], input_image.shape[1]

get model class name

class_names = handle.class_names
max_det = 1000  # maximum detections per image

# inference
# preprocess
test_pipeline = handle.test_pipeline
data, samples = test_pipeline(dict(img=input_image, img_id=0)).values()    
pad_param = samples.get('pad_param',
                        np.array([0, 0, 0, 0], dtype=np.float32))
pad_param = torch.tensor(
    [pad_param[2], pad_param[0], pad_param[2], pad_param[0]],
    device=device)
scale_factor = samples.get('scale_factor', [1., 1])
scale_factor = torch.tensor(scale_factor * 2, device=device)    
mean = torch.tensor([103.53, 116.28, 123.675], dtype=torch.float32).reshape(1, 3, 1, 1)
std = torch.tensor([57.375, 57.12, 58.395], dtype=torch.float32).reshape(1, 3, 1, 1)
data = data[None].float()
data -= mean
data /= std
data = data.to(device)

#inference
result = handle(data)
num_dets, bboxes, scores, labels = result
scores = scores[0, :num_dets]
bboxes = bboxes[0, :num_dets]
labels = labels[0, :num_dets]
bboxes -= pad_param
bboxes /= scale_factor

bboxes[:, 0::2].clamp_(0, img_w)
bboxes[:, 1::2].clamp_(0, img_h)
bboxes = bboxes.round().int()

` I also following demo/img_demo.py to inference, and don't add time for reading images, so i don't know why my latency is so slow, can you give me some advice or release your codes for testing TRT-FP16-Latency

Environment

CUDA 11.1 Cudnn 8 TRT 8.2.3

Expected results

No response

Additional information