Checklist

[x] I have searched related issues but cannot get the expected help.
[x] I have read the FAQ documentation but cannot get the expected help.
[x] The bug has not been fixed in the latest version.

Describe the bug

I run several common detection models (Deformable DETR, DINO, etc.) on 8xAMD-MI250 GPUs, the running speed is extremely slow no matter on training or inference, each iteration will cost about 30 mins in training, and the usage of GPUs is also unstable and low.

However when running these models with larger backbone, such as Swin-L or ViT-L, the speed will be normal.

One example of training deformable detr's encoder is attached: 20240612_113140.log

Reproduction

1. What command or script did you run?

Run any models such as Deformable DETR with defacult setting on ROCm envs. For example:

python -m torch.distributed.launch --nnodes 1 --nproc_per_node 8 --node_rank ${RANK} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --use_env \
    train.py \
    --config configs/deformable_detr/deformable-detr-refine-twostage_r50_16xb2-50e_coco.py \
    --resume \
    --launcher pytorch

2. Did you make any modifications on the code or config? Did you understand what you have modified?

No, I run the raw mmdet codes.

3. What dataset did you use?

COCO2017

Environment

Here is my running env:

System environment:
    sys.platform: linux
    Python: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]
    CUDA available: True
    numpy_random_seed: 681328528
    GPU 0,1,2,3,4,5,6,7: AMD Instinct MI250X/MI250
    CUDA_HOME: /opt/rocm-5.6.1
    NVCC: Not Available
    GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
    PyTorch: 2.0.1+rocm5.6.1
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - HIP Runtime 5.6.31062
  - MIOpen 2.20.0
  - Magma 2.6.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=ON, 

    TorchVision: 0.15.2+rocm5.6.1
    OpenCV: 4.8.1
    MMEngine: 0.9.1

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 681328528
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 8

open-mmlab / mmdetection

mmdet runs extremely slow on ROCm/AMD #11791