Inconsistent evaluation results

GuoSicen commented 1 year ago

I use "tools/test.py --eval" to test the test set of the results, and I also save the predicted pictures after the test, then they are compared with the ground truth to get the iou, fscore result. Two result is not consistent, if my config file on the test set is wrong, which due to the different result, the config file is as follows.

norm_cfg = dict(type='BN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='ANNHead',
        in_channels=[1024, 2048],
        in_index=[2, 3],
        channels=512,
        project_channels=256,
        query_scales=(1, ),
        key_pool_scales=(1, 3, 6, 8),
        dropout_ratio=0.1,
        num_classes=21,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=21,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
    train_cfg=dict(),
    test_cfg=dict(mode='whole'))
dataset_type = 'PascalVOCDataset'
data_root = 'data/VOC'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2048, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=4,
    workers_per_gpu=4,
    train=dict(
        type='PascalVOCDataset',
        data_root='data/VOC',
        img_dir='JPEGImages',
        ann_dir='SegmentationClassPNG',#
        split=['ImageSets/Segmentation/train.txt'],#
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ]),
    val=dict(
        type='PascalVOCDataset',
        data_root='data/VOC',#
        img_dir='JPEGImages',
        ann_dir='SegmentationClassPNG',#
        split='ImageSets/Segmentation/val.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='PascalVOCDataset',
        data_root='data/VOC',#
        img_dir='JPEGImages',
        ann_dir='SegmentationClassPNG',#
        split='ImageSets/Segmentation/test.txt',#
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=True),
        dict(type='TensorboardLoggerHook')
    ])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=4000)
checkpoint_config = dict(by_epoch=False, interval=100)
evaluation = dict(interval=100, metric='mIoU', pre_eval=True)
work_dir = './work_dirs/ann_r50-d8_512x512_20k_voc12aug/pretrain2'
gpu_ids = [0]
auto_resume = False

Rowan-L commented 1 year ago

I also encountered this problem and later found out that it was the different way of calculating IoU. The mmseg is calculated for the whole dataset, not the average of the sum of iou.

GuoSicen commented 1 year ago

If there are 1000 images, 600 are used for training, 100 for validation, and 300 for testing, what does "for the whole dataset" mean here? Does it mean dividing by the total number of images, not the number of images in each set? I can't understand. Could you please explain it in detail? Thank you. And could you please share a suitable solution?

Rowan-L commented 1 year ago

如果有 1000 张图片，600 张用于训练，100 张用于验证，300 张用于测试，这里的“for the whole dataset”是什么意思？这是否意味着除以图像总数，而不是每组图像的数量？我不明白。你能详细解释一下吗？谢谢。你能分享一个合适的解决方案吗？ I have noticed from some open source projects that two ways of calculating metrics exist. The first, as used by mmseg, is to calculate the intersection and concatenation of labels and predictions for the entire test set, which can be imagined as stitching the entire test set into one big picture and then calculating the IoU of this big picture. The second, as I have seen in other projects, is to calculate the IoU of a single sample in the test set , then sum up the IoU of each sample and divide by the number of samples. These two calculations will lead to different results.

GuoSicen commented 1 year ago

But before calculating the metrics, the pictures will be resized into the same size (refer to the code in config files: img_scale=(2048, 512),). in this situation, two ways you mentioned above will lead to the same results. I don't understand.

Rowan-L commented 1 year ago

But before calculating the metrics, the pictures will be resized into the same size (refer to the code in config files: img_scale=(2048, 512),). in this situation, two ways you mentioned above will lead to the same results. I don't understand.

In fact, the two approaches yield different results, or we are not talking about the same issue

GuoSicen commented 1 year ago

When the size of the dataset image is the same, the calculated results of the two methods you mentioned should be the same. Because the code part has the relevant code to resize the image size, I don't think that is the reason for the different results.

GuoSicen commented 1 year ago

@xiexinch 请问可以帮我看一下问题出在哪里了吗，还是有点懵

GuoSicen commented 1 year ago

@xiexinch @Rowan-L 我好像发现了问题，resize后跟了一个参数keep_ratio=true，如果为true，前面resize的img_scale不为最终resize的大小，而是一个最大最小区间，参见https://zhuanlan.zhihu.com/p/381117525 ，但有一个问题，如果是这样的话，测试结果是否正确，如果正确，如何得到resize过的预测图片，而不是原图大小？

MeowZheng commented 1 year ago

keep_ratio=True just make the length-to-width ratio is same as before resize. The prediction of model will be resized as the original size of image. If you use the original image to test, just modify test pipeline as

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='Normalize',mean=[123.675, 116.28, 103.53],std=[58.395, 57.12, 57.375],to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]

GuoSicen commented 1 year ago

Through the same config file, regardless of the keep_ratio=True operation, the result of the evaluation using tools/test.py should be the same as the result of iou test between the predicted images and ground truth, but why is it not the same?

MeowZheng commented 1 year ago

Would you like to tell me specific different mIoU? Please check there is no randomness in the model, and please check model is eval mode and check the checkpoint you used is consistent.

GuoSicen commented 1 year ago

The iou result obtained using the test.py file is 0.7593. The result obtained by predicting the image by the model, saving the image (in png format) and comparing the ground truth is 0.6960. There should be no randomness in the model, as the same result is obtained several times using the test.py file.

FlorinAndrei commented 1 year ago

@GuoSicen I have not examined your report in detail, so I can't be sure, but the magnitude of the difference suggests it may be related to this:

https://github.com/open-mmlab/mmsegmentation/issues/2655

open-mmlab / mmsegmentation

Inconsistent evaluation results #2594