SCNet model output is empty when the num_classes is changed

buncybunny commented 2 years ago

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug SCNet model output is empty when the num_classes is changed

Reproduction

What command or script did you run?

sudo env "PATH=$PATH" bash tools/dist_train.sh configs/scnet/wo_semantic_head/scnet_r50_fpn_1x_coco_wo_semantic_head.py 4 --work-dir work_dirs/scnet_r50_fpn_1x_coco_wo_semantic_head/all_Classes

Did you make any modifications on the code or config? Did you understand what you have modified? I removed the semantic_head & semantic_roi_extractor in the config and changed the num_classes into 60.

_base_ = '../../htc/htc_without_semantic_r50_fpn_1x_coco.py'
# model settings
model = dict(
type='SCNet',
roi_head=dict(
    _delete_=True,
    type='SCNetRoIHead',
    num_stages=3,
    stage_loss_weights=[1, 0.5, 0.25],
    bbox_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32],
        freeze=False),
    bbox_head=[
        dict(
            type='SCNetBBoxHead',
            num_shared_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            **num_classes=60,**
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0., 0., 0., 0.],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=True,
            freeze=False,
            loss_cls=dict(
                type='CrossEntropyLoss',
                use_sigmoid=False,
                loss_weight=1.0),
            loss_bbox=dict(type='SmoothL1Loss', beta=1.0,
                           loss_weight=1.0)),
        dict(
            type='SCNetBBoxHead',
            num_shared_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            **num_classes=60,**
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0., 0., 0., 0.],
                target_stds=[0.05, 0.05, 0.1, 0.1]),
            reg_class_agnostic=True,
            freeze=False,
            loss_cls=dict(
                type='CrossEntropyLoss',
                use_sigmoid=False,
                loss_weight=1.0),
            loss_bbox=dict(type='SmoothL1Loss', beta=1.0,
                           loss_weight=1.0)),
        dict(
            type='SCNetBBoxHead',
            num_shared_fcs=2,
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            **num_classes=60,**
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0., 0., 0., 0.],
                target_stds=[0.033, 0.033, 0.067, 0.067]),
            reg_class_agnostic=True,
            freeze=False,
            loss_cls=dict(
                type='CrossEntropyLoss',
                use_sigmoid=False,
                loss_weight=1.0),
            loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0))
    ],
    mask_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32], freeze=False),
    mask_head=dict(
        type='SCNetMaskHead',
        num_convs=12,
        in_channels=256,
        conv_out_channels=256,
        **num_classes=60,**
        conv_to_res=True,
        freeze=False,
        loss_mask=dict(
            type='CrossEntropyLoss', use_mask=True, loss_weight=1.0)),
    glbctx_head=dict(
        type='GlobalContextHead',
        num_convs=4,
        in_channels=256,
        conv_out_channels=256,
        **num_classes=60**,
        loss_weight=3.0,
        conv_to_res=True,
        freeze=False),
    feat_relay_head=dict(
        type='FeatureRelayHead',
        in_channels=1024,
        out_conv_channels=256,
        roi_feat_size=7,
        scale_factor=2,
        freeze=False)))

What dataset did you use? COCO

Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.


fatal: not a git repository (or any of the parent directories): .git
sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3: TITAN Xp
CUDA_HOME: /usr/local/cuda-10.1
NVCC: Cuda compilation tools, release 10.1, V10.1.168
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.3-Product Build 20210617 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.1, CUDNN_VERSION=7.6.3, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.1 OpenCV: 4.5.3 MMCV: 1.3.14 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMDetection: 2.17.0+

2. You may add addition that may be helpful for locating the problem, such as
    - How you installed PyTorch [e.g., pip, conda, source]
    - Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)

**Error traceback**
If applicable, paste the error trackback here.

```none
2021-11-24 06:55:34,695 - mmdet - INFO - Saving checkpoint at 1 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 18.2 task/s, elapsed: 274s, ETA:     0s

Traceback (most recent call last):
  File "tools/train.py", line 189, in <module>
    main()
  File "tools/train.py", line 185, in main
    meta=meta)
  File "/home/pr04/tmp/tmp/mmdetection/mmdet/apis/train.py", line 174, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_epoch')
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 237, in after_train_epoch
    self._do_evaluate(runner)
  File "/home/pr04/tmp/tmp/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 58, in _do_evaluate
    key_score = self.evaluate(runner, results)
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 325, in evaluate
    results, logger=runner.logger, **self.eval_kwargs)
  File "/home/pr04/tmp/tmp/mmdetection/mmdet/datasets/coco.py", line 433, in evaluate
    result_files, tmp_dir = self.format_results(results, jsonfile_prefix)
  File "/home/pr04/tmp/tmp/mmdetection/mmdet/datasets/coco.py", line 378, in format_results
    result_files = self.results2json(results, jsonfile_prefix)
  File "/home/pr04/tmp/tmp/mmdetection/mmdet/datasets/coco.py", line 315, in results2json
    json_results = self._segm2json(results)
  File "/home/pr04/tmp/tmp/mmdetection/mmdet/datasets/coco.py", line 266, in _segm2json
    data['category_id'] = self.cat_ids[label]
IndexError: list index out of range
Killing subprocess 19824
Killing subprocess 19825
Killing subprocess 19826
Killing subprocess 19827
Traceback (most recent call last):
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/pr04/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/pr04/anaconda3/envs/open-mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=3', 'configs_TFA/_fewshot_scnet/base/scnet/wo_semantic_head/scnet_r50_fpn_1x_coco_wo_semantic_head.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

I got into the _segm2json in coco.py and found out that the argument results(which is the output of the trained model) were empty tensors. I guess it has something to do with the num_classes since it worked properly when I set the num_classes to the default setting(even without the semantic head)

What should I do if I want to change the num_classes of SCNet?

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

AronLin commented 2 years ago

The num_classes of the model must be equal to the num_classes of the dataset. You should reset the CLASSES of the datasets too. Here is the example: https://github.com/open-mmlab/mmdetection/blob/master/docs/tutorials/customize_dataset.md#1-modify-the-config-file-for-using-the-customized-dataset.

buncybunny commented 2 years ago

@AronLin Thanks for the reply but unfortunately this doesn't address the issue at all. If my num_classes of the model and the num_classes of the dataset was different as you said, it would output the assertion error like below.

The `num_classes` (80) in SCNetBBoxHead of MMDistributedDataParallel does not matches the length of `CLASSES` 60) in CocoDataset

There was no assertion errors like this since I've already set the num_classes of the model(in the config) and the dataset(mmdet/datasets/coco.py) equal. The model could be trained well for 1epoch and the error occurred while 'after_train_epoch'. I debugged the code and found out the model output was empty.

AronLin commented 2 years ago

OK, got it. According to this line:

File "/home/pr04/tmp/tmp/mmdetection/mmdet/datasets/coco.py", line 266, in _segm2json data['category_id'] = self.cat_ids[label]

The predicted label may be out of range, you should check it first.

The model may not be well trained so that the output is empty. You can compare the official log to check whether the loss is decreasing normally.

buncybunny commented 2 years ago

@AronLin I solved the problem by replacing the underbars in the class names to ' '. Thanks

open-mmlab / mmdetection

SCNet model output is empty when the num_classes is changed #6583