open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.12k stars 9.38k forks source link

Re-training MaskRCNN on custom dataset with multiple classes only trains with a single class, ignoring the others #10578

Open emvollmer opened 1 year ago

emvollmer commented 1 year ago

Hi there,

I've previously used MMDet v2 and now have switched to v3 to retrain a MaskRCNN model on a custom dataset. Some background info about my dataset and the adaptations made to customize things:

Procedure

I used the following commands to call train.py and test.py:

python train.py /.../configs/mmdet/mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py 
--work-dir /.../model/outputs/

# the model from the last epoch is evaluated (here: epoch 60)
python test.py /.../configs/mmdet/mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py /.../model/outputs/epoch_60.pth 
--work-dir /.../model/outputs/eval/ 
--out /.../model/outputs/eval/predictions_epoch-60.pickle 
--show-dir /.../model/outputs/eval/plots/

# second script call with classwise evaluation
python test.py /.../configs/mmdet/mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py /.../model/outputs/epoch_60.pth 
--work-dir /.../model/outputs/eval_classwise/ 
--out /.../model/outputs/eval_classwise/predictions_epoch-60.pickle 
--show-dir /.../model/outputs/eval_classwise/plots/ 
--cfg-options test_evaluator.classwise=True

Expected results

Model trained to identify instances in images of all 11 classes.

Actual results

Everything runs without errors, but the test.py outputs show only a single class out of the 11 is being displayed in the plots and only the resulting metrics of that class are being shown.

I can't figure out where things are going wrong - if there's an issue with training or just in the display of the results. The class that is shown ("person") is the 9th out of 11, but is the last class to occur in both train and test datasets going by order of images, so maybe the outputs are being overwritten so only the last one remains?

Thanks in advance for any help, ideas or assistance you can provide! I've added more details below.

Details

Below you'll find config excerpts from the log, which in this case is the same for train.py and test.py. For evaluation, I used the standard CocoMetric and, through the DetLocalVisualizer, DumpDetResults. Important, relevant changes are in bold.

2023/06/28 10:02:20 - mmengine - INFO -
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.8.12 (default, Sep 16 2021, 10:46:05) [GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
    CUDA available: True
    numpy_random_seed: 42
    GPU 0: NVIDIA A100-PCIE-40GB
    CUDA_HOME: /.../cuda/11.8
    NVCC: Cuda compilation tools, release 11.8, V11.8.89
    GCC: gcc (GCC) 11.2.0
    PyTorch: 2.0.1+cu118
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: [...]

    TorchVision: 0.15.2+cu118
    OpenCV: 4.7.0
    MMEngine: 0.7.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 42
    Distributed launcher: none
    Distributed training: False
    GPU number: 1
------------------------------------------------------------

2023/06/28 10:02:21 - mmengine - INFO - Config:
img_norm_cfg = dict(
    mean=[18.72, 19.515, 18.903, 130.248], std=[19.375, 21.03, 21.674, 43.674])
model = dict(
    type='MaskRCNN',
    data_preprocessor=dict(
        type='CustomDataPreprocessor',
        mean=[18.72, 19.515, 18.903, 130.248],
        std=[19.375, 21.03, 21.674, 43.674],
        pad_mask=True,
        pad_size_divisor=32),
    backbone=dict(
        type='ResNet',
        depth=50,
        in_channels=4,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        [...]),
    roi_head=dict(
        type='StandardRoIHead',
        [...]),
    train_cfg=dict([...]),
    test_cfg=dict([...])
)
data_root = '/.../model/data/'
train_img_prefix = 'train/'
val_img_prefix = 'test/'
dataset_type = 'CocoDataset'
metainfo = dict(
    CLASSES=('building', 'car (cold)', 'car (warm)', 'manhole (round) cold',
             'manhole (round) warm', 'manhole (square) cold',
             'manhole (square) warm', 'miscellaneous', 'person',
             'street lamp cold', 'street lamp warm'))
train_ann_file = 'train/annotations/thermal_annotations_coco.json'
val_ann_file = 'test/annotations/thermal_annotations_coco.json'
test_ann_file = 'test/annotations/thermal_annotations_coco.json'

train_pipeline = [
    dict(type='LoadNumpyImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(type='Resize', scale=(3750, 3000), keep_ratio=True),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PackDetInputs')
]
test_pipeline = [
    dict(type='LoadNumpyImageFromFile'),
    dict(type='Resize', scale=(3750, 3000), keep_ratio=True),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(
        type='PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                   'scale_factor'))
]
train_dataloader = dict(
    batch_size=2,
    num_workers=2,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    batch_sampler=dict(type='AspectRatioBatchSampler'),
    dataset=dict(
        type='CocoDataset',
        data_root='/.../model/data/',
        ann_file='/.../model/data/train/annotations/thermal_annotations_coco.json',
        data_prefix=dict(img='train/'),
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=[
            dict(type='LoadNumpyImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
            dict(type='Resize', scale=(3750, 3000), keep_ratio=True),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PackDetInputs')
        ],
        metainfo=dict(
            CLASSES=('building', 'car (cold)', 'car (warm)',
                     'manhole (round) cold', 'manhole (round) warm',
                     'manhole (square) cold', 'manhole (square) warm',
                     'miscellaneous', 'person', 'street lamp cold',
                     'street lamp warm'))))
val_dataloader = dict(
    batch_size=2,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='CocoDataset',
        data_root='/.../model/data/',
        ann_file='/.../model/data/test/annotations/thermal_annotations_coco.json',
        data_prefix=dict(img='test/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadNumpyImageFromFile'),
            dict(type='Resize', scale=(3750, 3000), keep_ratio=True),
            dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
            dict(
                type='PackDetInputs',
                meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                           'scale_factor'))
        ],
        metainfo=dict(
            CLASSES=('building', 'car (cold)', 'car (warm)',
                     'manhole (round) cold', 'manhole (round) warm',
                     'manhole (square) cold', 'manhole (square) warm',
                     'miscellaneous', 'person', 'street lamp cold',
                     'street lamp warm'))))
test_dataloader = dict(
    batch_size=2,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='CocoDataset',
        data_root='/.../model/data/',
        ann_file='/.../model/data/test/annotations/thermal_annotations_coco.json',
        data_prefix=dict(img='test/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadNumpyImageFromFile'),
            dict(type='Resize', scale=(3750, 3000), keep_ratio=True),
            dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
            dict(
                type='PackDetInputs',
                meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                           'scale_factor'))
        ],
        metainfo=dict(
            CLASSES=('building', 'car (cold)', 'car (warm)',
                     'manhole (round) cold', 'manhole (round) warm',
                     'manhole (square) cold', 'manhole (square) warm',
                     'miscellaneous', 'person', 'street lamp cold',
                     'street lamp warm'))))

val_evaluator = dict(
    type='CocoMetric',
    ann_file='/.../model/data/test/annotations/thermal_annotations_coco.json',
    metric=['bbox', 'segm'],
    format_only=False)
test_evaluator = dict(
    type='CocoMetric',
    ann_file='/.../model/data/test/annotations/thermal_annotations_coco.json',
    metric=['bbox', 'segm'],
    format_only=False)

train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=60, val_interval=1)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
param_scheduler = [
    dict(
        type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
    dict(
        type='MultiStepLR',
        begin=0,
        end=12,
        by_epoch=True,
        milestones=[8, 11],
        gamma=0.1)
]
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001))
auto_scale_lr = dict(enable=False, base_batch_size=16)
custom_imports = dict(
    imports=['numpy_loader', 'data_preprocessor'], allow_failed_imports=False)
randomness = dict(seed=42)
default_scope = 'mmdet'
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', interval=1),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='NumpyDetVisualizationHook'))
env_cfg = dict(
    cudnn_benchmark=False,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'))
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='DetLocalVisualizer',
    vis_backends=[dict(type='LocalVisBackend')],
    name='visualizer')
log_processor = dict(type='LogProcessor', window_size=50, by_epoch=True)
log_level = 'INFO'
[...]
launcher = 'none'
work_dir = '/.../model/outputs/'

023/06/28 10:35:17 - mmengine - INFO - Evaluating bbox... 2023/06/28 10:35:17 - mmengine - INFO - bbox_mAP_copypaste: 0.404 0.685 0.407 0.252 0.419 0.200 2023/06/28 10:35:17 - mmengine - INFO - Evaluating segm... 2023/06/28 10:35:17 - mmengine - INFO - segm_mAP_copypaste: 0.351 0.676 0.301 0.076 0.361 0.300 2023/06/28 10:35:17 - mmengine - INFO - Epoch(val) [60][80/80] coco/bbox_mAP: 0.4040 coco/bbox_mAP_50: 0.6850 coco/bbox_mAP_75: 0.4070 coco/bbox_mAP_s: 0.2520 coco/bbox_mAP_m: 0.4190 coco/bbox_mAP_l: 0.2000 coco/segm_mAP: 0.3510 coco/segm_mAP_50: 0.6760 coco/segm_mAP_75: 0.3010 coco/segm_mAP_s: 0.0760 coco/segm_mAP_m: 0.3610 coco/segm_mAP_l: 0.3000 data_time: 0.0538 time: 0.4626


- the first `test.py` call results in the following metrics:

2023/06/30 10:16:52 - mmengine - WARNING - The prefix is not set in metric class DumpDetResults. 2023/06/30 10:16:54 - mmengine - INFO - Load checkpoint from /.../model/outputs/epoch_60.pth 2023/06/30 10:20:37 - mmengine - INFO - Epoch(test) [ 50/159] eta: 0:07:51 time: 4.3254 data_time: 3.8272 memory: 3966 2023/06/30 10:24:13 - mmengine - INFO - Epoch(test) [100/159] eta: 0:04:16 time: 4.3728 data_time: 4.0667 memory: 3966 2023/06/30 10:27:51 - mmengine - INFO - Epoch(test) [150/159] eta: 0:00:39 time: 4.3717 data_time: 4.0541 memory: 3966 2023/06/30 10:28:27 - mmengine - INFO - Evaluating bbox... 2023/06/30 10:28:27 - mmengine - INFO - bbox_mAP_copypaste: 0.404 0.685 0.407 0.252 0.419 0.200 2023/06/30 10:28:27 - mmengine - INFO - Evaluating segm... 2023/06/30 10:28:27 - mmengine - INFO - segm_mAP_copypaste: 0.350 0.675 0.301 0.076 0.361 0.300 2023/06/30 10:28:27 - mmengine - INFO - Results has been saved to /.../model/outputs/eval/predictions_epoch-60.pickle. 2023/06/30 10:28:27 - mmengine - INFO - Epoch(test) [159/159] coco/bbox_mAP: 0.4040 coco/bbox_mAP_50: 0.6850 coco/bbox_mAP_75: 0.4070 coco/bbox_mAP_s: 0.2520 coco/bbox_mAP_m: 0.4190 coco/bbox_mAP_l: 0.2000 coco/segm_mAP: 0.3500 coco/segm_mAP_50: 0.6750 coco/segm_mAP_75: 0.3010 coco/segm_mAP_s: 0.0760 coco/segm_mAP_m: 0.3610 coco/segm_mAP_l: 0.3000 data_time: 3.9675 time: 4.3337


- the second `test.py` with the additional `classwise=True` in `test_evaluator` shows only a single class (person) with the same metric results as the overall ones.

2023/06/30 14:05:50 - mmengine - WARNING - The prefix is not set in metric class DumpDetResults. 2023/06/30 14:05:57 - mmengine - INFO - Load checkpoint from /.../model/outputs/epoch_60.pth 2023/06/30 14:09:53 - mmengine - INFO - Epoch(test) [ 50/159] eta: 0:08:19 time: 4.5861 data_time: 3.8918 memory: 3966 2023/06/30 14:13:29 - mmengine - INFO - Epoch(test) [100/159] eta: 0:04:24 time: 4.3830 data_time: 4.0796 memory: 3966 2023/06/30 14:17:09 - mmengine - INFO - Epoch(test) [150/159] eta: 0:00:40 time: 4.3968 data_time: 4.0065 memory: 3966 2023/06/30 14:17:44 - mmengine - INFO - Evaluating bbox... 2023/06/30 14:17:44 - mmengine - INFO - +----------+-------+--------+--------+-------+-------+-------+ | category | mAP | mAP_50 | mAP_75 | mAP_s | mAP_m | mAP_l | +----------+-------+--------+--------+-------+-------+-------+ | person | 0.404 | 0.685 | 0.407 | 0.252 | 0.419 | 0.2 | +----------+-------+--------+--------+-------+-------+-------+ 2023/06/30 14:17:44 - mmengine - INFO - bbox_mAP_copypaste: 0.404 0.685 0.407 0.252 0.419 0.200 2023/06/30 14:17:44 - mmengine - INFO - Evaluating segm... 2023/06/30 14:17:45 - mmengine - INFO - +----------+------+--------+--------+-------+-------+-------+ | category | mAP | mAP_50 | mAP_75 | mAP_s | mAP_m | mAP_l | +----------+------+--------+--------+-------+-------+-------+ | person | 0.35 | 0.675 | 0.301 | 0.076 | 0.361 | 0.3 | +----------+------+--------+--------+-------+-------+-------+ 2023/06/30 14:17:45 - mmengine - INFO - segm_mAP_copypaste: 0.350 0.675 0.301 0.076 0.361 0.300 2023/06/30 14:17:45 - mmengine - INFO - Results has been saved to /.../model/outputs/eval_classwise/predictions_epoch-60.pickle. 2023/06/30 14:17:45 - mmengine - INFO - Epoch(test) [159/159] coco/person_precision: 0.3500 coco/bbox_mAP: 0.4040 coco/bbox_mAP_50: 0.6850 coco/bbox_mAP_75: 0.4070 coco/bbox_mAP_s: 0.2520 coco/bbox_mAP_m: 0.4190 coco/bbox_mAP_l: 0.2000 coco/segm_mAP: 0.3500 coco/segm_mAP_50: 0.6750 coco/segm_mAP_75: 0.3010 coco/segm_mAP_s: 0.0760 coco/segm_mAP_m: 0.3610 coco/segm_mAP_l: 0.3000 data_time: 3.9770 time: 4.4266

emvollmer commented 1 year ago

I haven't been able to solve the problem but it's definitely an issue with training.

I have 634 training images, 159 test images. The previously displayed logs were running train.py with a batch_size=2. That means only 39 * 2 = 78 images are used in training, which coincidentally is the exact number of images that contain a person annotation. All others are apparently ignored?

I've double-checked both the MMDetection InstanceSeg_Tutorial and general demo to ensure I have the configs adapted correctly for retraining - which was the case. In particular, both cfg.model.roi_head.bbox_head.num_classes and cfg.model.roi_head.mask_head.num_classes are set to match the number of classes I have.

I've compared all my configs with their original counterparts here (see mask-rcnn_r50_fpn_1x_coco.py and the files it inherits from) and am using the current train.py.

As I'm at a loss as to where else this could be coming from, any help would be greatly appreciated!

(FYI: Another peculiar thing that's happening is that the statement ----FORWARD IS DONE----- gets printed dozens of times in between log messages. I haven't had that happen using previous versions of MMDet and haven't seen it mentioned here anywhere either...Probably not correlated but still wanted to mention it)