open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.29k stars 9.41k forks source link

Discobox evaluation failed with custom dataset - implementation based on mmdet #8170

Open ameyparanjape opened 2 years ago

ameyparanjape commented 2 years ago

Type of reimplementation - 3) Reimplement a custom model but all the components are implemented in MMDetection - DiscoBox

Checklist

  1. I have searched related issues but cannot get the expected help - Done
  2. The issue has not been fixed in the latest version. - Not Sure

Describe the issue

A clear and concise description of what the problem you meet and what have you done.

Reproduction

  1. What command or script did you run?
bash tools/dist_train.sh configs/discobox/custom_discobox_solov2_r50_fpn_3x.py 2
  1. What config dir you run?
fp16 = dict(loss_scale=512.)
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
model = dict(
    type='DiscoBoxSOLOv2',
    pretrained='torchvision://resnet50',
    train_cfg=dict(),
    test_cfg = dict(
        nms_pre=500,
        score_thr=0.1,
        mask_thr=0.4,
        update_thr=0.05,
        kernel='gaussian',  # gaussian/linear
        sigma=2.0,
        max_per_img=100),
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3), # C2, C3, C4, C5
        frozen_stages=1,
        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=0,
        num_outs=5),
    bbox_head=dict(
        type='DiscoBoxSOLOv2Head',
        num_classes=6,
        in_channels=256,
        stacked_convs=4,
        seg_feat_channels=512,
        strides=[8, 8, 16, 32, 32],
        scale_ranges=((1, 96), (48, 192), (96, 384), (192, 768), (384, 2048)),
        sigma=0.2,
        num_grids=[40, 36, 24, 16, 12],
        ins_out_channels=256,
        loss_ins=dict(
            type='DiceLoss',
            use_sigmoid=True,
            loss_weight=1.0),
        loss_ts=dict(
            type='DiceLoss',
            momentum=0.999,
            use_ind_teacher=True,
            loss_weight=1.0,
            kernel=3,
            max_iter=10,
            alpha0=2.0,
            theta0=0.5,
            theta1=30.0,
            theta2=20.0,
            base=0.10,
            crf_height=28,
            crf_width=28,
        ),
        loss_cate=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_corr=dict(
            type='InfoNCE',
            loss_weight=1.0,
            corr_exp=1.0,
            corr_eps=0.05,
            gaussian_filter_size=3,
            low_score=0.3,
            corr_num_iter=10,
            corr_num_smooth_iter=1,
            save_corr_img=False,
            dist_kernel=9,
            obj_bank=dict(
                img_norm_cfg=img_norm_cfg,
                len_object_queues=100,
                fg_iou_thresh=0.7,
                bg_iou_thresh=0.7,
                ratio_range=[0.9, 1.2],
                appear_thresh=0.7,
                min_retrieval_objs=2,
                max_retrieval_objs=5,
                feat_height=7,
                feat_width=7,
                mask_height=28,
                mask_width=28,
                img_height=200,
                img_width=200,
                min_size=32,
                num_gpu_bank=20,
            )
        )
    ),
    mask_feat_head=dict(
            type='DiscoBoxMaskFeatHead',
            in_channels=256,
            out_channels=128,
            start_level=0,
            end_level=3,
            num_classes=256,
            norm_cfg=dict(type='GN', num_groups=32, requires_grad=True)),
    )

# dataset settings
dataset_type = 'CocoDataset'
data_root = 'data/'
classes = ('a','b','c','d','e','f',)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='GenerateBoxMask'),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=8,
    workers_per_gpu=0,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/train.json',
        img_prefix=data_root + 'train/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val.json',
        img_prefix=data_root + 'val/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/test.json',
        img_prefix=data_root + 'test/',
        pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=2000,
    warmup_ratio=0.01,
    step=[8, 9])
checkpoint_config = dict(interval=1)
# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
# runtime settings
runner = dict(type='EpochBasedRunner', max_epochs=20)
evaluation = dict(interval=1, metric=['bbox', 'segm'])
device_ids = range(8)
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = './work_dirs/exp1'
load_from = None
resume_from = None
workflow = [('train', 1)]
  1. Did you make any modifications on the code or config? Did you understand what you have modified? Modified the custom config for custom Coco-style datasets with 6 classes

  2. What dataset did you use? Custom coco style dataset with bbox and segmentation annotations

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
sys.platform: linux
Python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) [GCC 9.4.0]
CUDA available: True
GPU 0,1: Tesla P100-PCIE-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.3.r11.3/compiler.29920130_0
GCC: gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0
OpenCV: 4.6.0
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMDetection: 2.25.0+5bdb4ee
  1. You may add addition that may be helpful for locating the problem, such as
    1. How you installed PyTorch [e.g., pip, conda, source] - conda

Results

If applicable, paste the related results here, e.g., what you expect and what you get. Expected behaviour - run training and evaluation Error during evaluation

  [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                     ] 166/279, 1.3 task/s, elapsed: 124s, ETA:    85sTraceback (most recent call last):
  File "/opt/conda/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/opt/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/envs/open-mmlab/bin/python', '-u', 'tools/test.py', '--local_rank=0', 'configs/discobox/custom_solov2_r50_fpn_3x.py', 'work_dirs/roboflow_data/epoch_1.pth', '--launcher', 'pytorch', '--eval', 'bbox', 'segm']' died with <Signals.SIGKILL: 9>.

Any help will be appreciated!

chhluo commented 2 years ago

The process died with <Signals.SIGKILL: 9>.

There may existed problems in training such as out of memory or run out of cpus.

Besides, there is a issue related, see https://github.com/open-mmlab/mmdetection/issues/3907.