WHY THE VALIDATION ALWAYS ADOPTS THE ITERATION NO.500 WEIGHTS?

yan-hao-tian commented 3 years ago

I am running a customized model with this awesome repo. My problem is that I have run it on two machines, but the performance difference is really big(5%). Through a careful comparison, I believe that the network architecture, the training setting and the data are all the same in these two training courses, and the version of the mmsegmentation is different (one is 0.12.0+adca7b9 another is 0.14.0+5d46314). However, I find a little gap in their logs. In each validation, the 0.12.0 log shows that in a manner Iter(val) [28000] mIoU: 0.7254, mAcc: 0.8161, aAcc: 0.9539, IoU.road: 0.9795, IoU.sidewalk: 0.8321, IoU.building: 0.9145, IoU.wall: 0.4059, IoU.fence: 0.5777, IoU.pole: 0.6191, IoU.traffic light: 0.6997, IoU.traffic sign: 0.7788, IoU.vegetation: 0.9139, IoU.terrain: 0.5787, IoU.sky: 0.9350, IoU.person: 0.7334, IoU.rider: 0.4358, IoU.car: 0.9459, IoU.truck: 0.6914, IoU.bus: 0.8188, IoU.train: 0.5697, IoU.motorcycle: 0.5911, IoU.bicycle: 0.7610, Acc.road: 0.9872, Acc.sidewalk: 0.9331, Acc.building: 0.9582, Acc.wall: 0.4476, Acc.fence: 0.7119, Acc.pole: 0.7431, Acc.traffic light: 0.8470, Acc.traffic sign: 0.8604, Acc.vegetation: 0.9662, Acc.terrain: 0.6621, Acc.sky: 0.9660, Acc.person: 0.8298, Acc.rider: 0.8479, Acc.car: 0.9748, Acc.truck: 0.7268, Acc.bus: 0.8473, Acc.train: 0.5773, Acc.motorcycle: 0.7610, Acc.bicycle: 0.8574 In contrast, the 0.14.0 log shows that Iter [500/80000] lr: 5.405e-03, eta: 16:21:04, time: 1.354, data_time: 0.016, memory: 23274, aAcc: 0.9440, mIoU: 0.6553, mAcc: 0.7293, IoU.road: 0.9750, IoU.sidewalk: 0.7975, IoU.building: 0.8978, IoU.wall: 0.3486, IoU.fence: 0.4156, IoU.pole: 0.4608, IoU.traffic light: 0.6247, IoU.traffic sign: 0.7361, IoU.vegetation: 0.8999, IoU.terrain: 0.5254, IoU.sky: 0.9263, IoU.person: 0.7639, IoU.rider: 0.5327, IoU.car: 0.9082, IoU.truck: 0.4573, IoU.bus: 0.5918, IoU.train: 0.3723, IoU.motorcycle: 0.4882, IoU.bicycle: 0.7293, Acc.road: 0.9862, Acc.sidewalk: 0.8986, Acc.building: 0.9649, Acc.wall: 0.3756, Acc.fence: 0.4411, Acc.pole: 0.5404, Acc.traffic light: 0.7365, Acc.traffic sign: 0.8005, Acc.vegetation: 0.9608, Acc.terrain: 0.5934, Acc.sky: 0.9735, Acc.person: 0.8987, Acc.rider: 0.6601, Acc.car: 0.9760, Acc.truck: 0.4810, Acc.bus: 0.6360, Acc.train: 0.3780, Acc.motorcycle: 0.7376, Acc.bicycle: 0.8183, decode.loss_seg: 0.1664, decode.acc_seg: 89.6364, aux.loss_seg: 0.1033, aux.acc_seg: 86.9771, loss: 0.2697 The bold part shows the difference(each validation in 0.14.0 uses the No.500 weights), which also exists in the log.json file. 0.14.0{"mode": "train", "epoch": 98, "iter": 500, "lr": 0.00135, "memory": 23289, "aAcc": 0.953, "mIoU": 0.7449, "mAcc": 0.8197, "IoU.road": 0.9779, "IoU.sidewalk": 0.8264, "IoU.building": 0.9115, "IoU.wall": 0.4454, "IoU.fence": 0.5855, "IoU.pole": 0.4794, "IoU.traffic light": 0.6638, "IoU.traffic sign": 0.7489, "IoU.vegetation": 0.9104, "IoU.terrain": 0.6164, "IoU.sky": 0.9332, "IoU.person": 0.7783, "IoU.rider": 0.5789, "IoU.car": 0.9402, "IoU.truck": 0.7853, "IoU.bus": 0.8431, "IoU.train": 0.7209, "IoU.motorcycle": 0.6521, "IoU.bicycle": 0.7561, "Acc.road": 0.9874, "Acc.sidewalk": 0.9095, "Acc.building": 0.9633, "Acc.wall": 0.4982, "Acc.fence": 0.6653, "Acc.pole": 0.5509, "Acc.traffic light": 0.7726, "Acc.traffic sign": 0.8229, "Acc.vegetation": 0.968, "Acc.terrain": 0.6844, "Acc.sky": 0.9658, "Acc.person": 0.9114, "Acc.rider": 0.7198, "Acc.car": 0.9772, "Acc.truck": 0.8697, "Acc.bus": 0.8978, "Acc.train": 0.7599, "Acc.motorcycle": 0.7674, "Acc.bicycle": 0.8831, "data_time": 0.6854, "decode.loss_seg": 0.12487, "decode.acc_seg": 89.96166, "aux.loss_seg": 0.07605, "aux.acc_seg": 88.0737, "loss": 0.20093, "time": 2.21966} 0.12.0{"mode": "val", "epoch": 33, "iter": 12000, "lr": 5e-05, "mIoU": 0.4241, "mAcc": 0.524, "aAcc": 0.8841, "IoU.road": 0.9151, "IoU.sidewalk": 0.6107, "IoU.building": 0.784, "IoU.wall": 0.2785, "IoU.fence": 0.2254, "IoU.pole": 0.2819, "IoU.traffic light": 0.047, "IoU.traffic sign": 0.3331, "IoU.vegetation": 0.8538, "IoU.terrain": 0.4476, "IoU.sky": 0.8661, "IoU.person": 0.4775, "IoU.rider": 0.0895, "IoU.car": 0.8109, "IoU.truck": 0.0517, "IoU.bus": 0.3099, "IoU.train": 0.1933, "IoU.motorcycle": 0.03, "IoU.bicycle": 0.4519, "Acc.road": 0.934, "Acc.sidewalk": 0.8033, "Acc.building": 0.907, "Acc.wall": 0.3544, "Acc.fence": 0.2874, "Acc.pole": 0.3629, "Acc.traffic light": 0.0478, "Acc.traffic sign": 0.3821, "Acc.vegetation": 0.9434, "Acc.terrain": 0.5618, "Acc.sky": 0.9637, "Acc.person": 0.7612, "Acc.rider": 0.0988, "Acc.car": 0.9088, "Acc.truck": 0.0532, "Acc.bus": 0.4091, "Acc.train": 0.3907, "Acc.motorcycle": 0.0306, "Acc.bicycle": 0.7553}

I can solve the problem by using the 0.12.0 version all time. But I would still like to know how to use the 0.14.0 to validate correctly? Or even there is no problem in the validation of 0.14.0? Or I use a truly wrong version unfortunately?

Thanks a lot.

xiexinch commented 3 years ago

Hi @yan-hao-tian Sorry for late reply. Could you show your settings of evaluation field and your training command? In your 0.14.0 log, the eval_iter_num field is missing, I want to check whether the evaluation code were executed.

yan-hao-tian commented 3 years ago

The training command is CUDA_VISIBLE_DEVICES=0,1 ./tools/dist_train.sh configs/deeplabv3/deeplabv3_r50-d8_512x1024_40k_cityscapes.py 2 For the settings of evaluation field, is the entire log ok?

norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='ASPPHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        dilations=(1, 12, 24, 36),
        dropout_ratio=0.1,
        num_classes=19,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=19,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
    train_cfg=dict(),
    test_cfg=dict(mode='whole'))
dataset_type = 'CityscapesDataset'
data_root = 'data/cityscapes/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 1024)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(2048, 1024), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 1024), cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(512, 1024), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2048, 1024),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='CityscapesDataset',
        data_root='data/cityscapes/',
        img_dir='leftImg8bit/train',
        ann_dir='gtFine/train',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(
                type='Resize', img_scale=(2048, 1024), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 1024), cat_max_ratio=0.75),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 1024), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ]),
    val=dict(
        type='CityscapesDataset',
        data_root='data/cityscapes/',
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 1024),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CityscapesDataset',
        data_root='data/cityscapes/',
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 1024),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=40000)
checkpoint_config = dict(by_epoch=False, interval=4000)
evaluation = dict(interval=4000, metric='mIoU')
work_dir = './work_dirs/deeplabv3_r50-d8_512x1024_40k_cityscapes'
gpu_ids = range(0, 1)

xiexinch commented 2 years ago

Hi @yan-hao-tian, The MMSegmentation 1.x has been released, this problem is solved.

Will close the issue, as there is no activity for a while. We hope your issue has been resolved. If not, please feel free to open a new one.

open-mmlab / mmsegmentation

WHY THE VALIDATION ALWAYS ADOPTS THE ITERATION NO.500 WEIGHTS? #596