CenterPoint Speed using `inference_detector` is extremely slow ~2.6s/it on A6000

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

I am using the provided Dockerfile

sys.platform: linux
Python: 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: NVIDIA RTX A6000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.9.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0
OpenCV: 4.7.0
MMEngine: 0.7.4
MMDetection: 3.0.0
MMDetection3D: 1.1.0+
spconv2.0: False

Reproduces the problem - code sample

detect.py

from argparse import ArgumentParser
import glob
import os
import time

from mmdet3d.apis import inference_detector, init_model
from tqdm import tqdm
from pypcd_imp import pypcd
import numpy as np

def parse_args():
    parser = ArgumentParser()
    parser.add_argument('data_dir', type=str, help='Directory of pointcloud files')
    parser.add_argument('output_dir', type=str, help='Directory to store results')
    parser.add_argument('config', help='Config file')
    parser.add_argument('checkpoint', help='Checkpoint file')
    parser.add_argument('--device', default='cuda:0', help='Device used for inference')

    args = parser.parse_args()
    return args

LABELS_WAYMO = [
    'car', 'pedestrian', 'cyclist'
]

def main(args):
    model = init_model(args.config, args.checkpoint, device=args.device)

    pcd_data = []
    pcd_files = glob.glob(f"{args.data_dir}/*.pcd")
    pcd_files = sorted(pcd_files)

    for pcd_bin in tqdm(pcd_files):
        pcd = pypcd.PointCloud.from_path(pcd_bin)
        points = np.stack(
            (
                pcd.pc_data['x'],
                pcd.pc_data['y'],
                pcd.pc_data['z'],
                pcd.pc_data['intensity'],
                np.zeros_like(pcd.pc_data['x']) # ring_idx
                ),
            axis=-1
        )

        points = points[points[:,0] < 51.2]
        points = points[points[:,0] > -51.2]
        points = points[points[:,1] < 51.2]
        points = points[points[:,1] > -51.2]
        points = points[points[:,2] < 3.0]
        points = points[points[:,2] > -5.0]
        pcd_data.append(points)

    for idx, _ in tqdm(enumerate(pcd_data)):
        points = pcd_data[idx]
        t1 = time.monotonic()
        result, data = inference_detector(model, points)
        t2 = time.monotonic()

        print(t2 - t1)

if __name__ == '__main__':
    args = parse_args()
    main(args)

Reproduces the problem - command or script

using the official weights:

python detect.py \ 
    some_input_pcds \
    some_output_dir \
    path_to/centerpoint_pillar02_second_secfpn_head-dcn_8xb4-cyclic-20e_nus-3d.py \
    path_tocenterpoint_02pillar_second_secfpn_dcn_4x8_cyclic_20e_nus_20220811_045458-808e69ad.pth

Reproduces the problem - error message

The model is extremely slow ~2.6s per inference

Additional information

I expected inference that is significantly faster.

Any ideas?

open-mmlab / mmdetection3d

CenterPoint Speed using `inference_detector` is extremely slow ~2.6s/it on A6000 #2629