[Bug] Bounding Box Loss Always Zero with Rotated RTMDet Model on New Dataset

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (master) or latest version (1.x).

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

1.x branch https://github.com/open-mmlab/mmrotate/tree/1.x

Environment

sys.platform: win32 Python: 3.8.19 (default, Mar 20 2024, 19:55:45) [MSC v.1916 64 bit (AMD64)] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8 NVCC: Cuda compilation tools, release 11.8, V11.8.89 MSVC: Microsoft (R) C/C++ Optimizing Compiler versione 19.39.33523 per x64 GCC: n/a PyTorch: 1.8.0+cu111 PyTorch compiling details: PyTorch built with:

C++ Version: 199711
MSVC 192829337
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 2019
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.0.5
Magma 2.5.4
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=C:/w/b/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj - DUSE_PTHREADPOOL -openmp:experimental -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

TorchVision: 0.9.0+cu111 OpenCV: 4.10.0 MMEngine: 0.10.4 MMRotate: 1.0.0rc1+1dc8d77

Reproduces the problem - code sample

I attempted to train the rotated_rtmdet_l-coco_pretrain-3x-x model on a new dataset by adjusting the number of classes in the settings. However, I observed that the bounding box loss (bbloss) is always zero. Upon debugging, I found that my model predicts all classes as background, which are then discarded, resulting in a loss of zero.

Steps to Reproduce:

Load the rotated_rtmdet_l-coco_pretrain-3x-x model. Modify the number of classes in the configuration to match the new dataset. Start training the model on the new dataset. Observe the bounding box loss during training. Observed Behavior:

The bounding box loss remains at zero throughout the training. The model predicts all classes as background, leading to a loss of zero. Expected Behavior:

The model should correctly predict the classes in the new dataset, and the bounding box loss should reflect the predictions and ground truth differences. Hyperparameters Tried:

Different learning rates Various batch sizes Adjusted weight decay and momentum parameters Additional Information:

I am confident that I am using the pre-trained weights from the COCO dataset. The class annotations in the new dataset have been verified to be correct. Possible Causes Considered:

Incorrect number of classes specified in the configuration Pre-trained weights not being properly loaded Data augmentation or preprocessing issues

Reproduces the problem - command or script

python tools/train.py --config configs/rotated_rtmdet/rotated_rtmdet_l-coco_pretrain-3x-dataset3.py

Reproduces the problem - error message

bbloss == 0

Additional information

No response

open-mmlab / mmrotate