'loss log variables are different across GPUs!' HowTo prevent?

julled commented 2 years ago

Hi,

i am using mmdet v2.20. I have a custom coco dataset with about 2k images with GroundTruth labels. I train a fasterrcnn and if i only use images with GT Labels everything learns smoothly. If i add about 0.5k additional images without GT Labels i get at random epochs the error 'loss log variables are different across GPUs!' and the training aborts. The message was built in to prevent GPU hanging via https://github.com/open-mmlab/mmsegmentation/pull/1035.

Can you give me a hint how to prevent this?

Does it have something to do that sometimes a GPU doesnt produce a loss because there are only images without GT Labels?
Could a unshuffled dataset which always guarantees that a batch got always a image with GT labels and without?

Here is my config:


interval = 2
checkpoint_config = dict(interval=2)
evaluation = dict(interval=2, metric='bbox', iou_thrs=[0.1, 0.5, 0.9])
log_config = dict(
    interval=20,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
dataset_type = 'CocoDataset'
classes = ('boat', )
img_norm_cfg = dict(
    mean=[91.622, 120.867, 166.518], std=[51.634, 39.997, 42.519], to_rgb=True)
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1640, 1232),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[91.622, 120.867, 166.518],
                std=[51.634, 39.997, 42.519],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1640, 1232), keep_ratio=True),
    dict(
        type='RandomFlip',
        flip_ratio=0.5,
        direction=['horizontal', 'vertical', 'diagonal']),
    dict(
        type='Normalize',
        mean=[91.622, 120.867, 166.518],
        std=[51.634, 39.997, 42.519],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]

valdataset = dict(
    type='CocoDataset',
    data_root='/data',
    classes=('boat', ),
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(
            type='MultiScaleFlipAug',
            img_scale=(1640, 1232),
            flip=False,
            transforms=[
                dict(type='Resize', keep_ratio=True),
                dict(type='RandomFlip'),
                dict(
                    type='Normalize',
                    mean=[91.622, 120.867, 166.518],
                    std=[51.634, 39.997, 42.519],
                    to_rgb=True),
                dict(type='Pad', size_divisor=32),
                dict(type='ImageToTensor', keys=['img']),
                dict(type='Collect', keys=['img'])
            ])
    ],
    ann_file=[
        '.json'
    ],
    filter_empty_gt=False)
testdataset = dict(
    type='CocoDataset',
    data_root='/data',
    classes=('boat', ),
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(
            type='MultiScaleFlipAug',
            img_scale=(1640, 1232),
            flip=False,
            transforms=[
                dict(type='Resize', keep_ratio=True),
                dict(type='RandomFlip'),
                dict(
                    type='Normalize',
                    mean=[91.622, 120.867, 166.518],
                    std=[51.634, 39.997, 42.519],
                    to_rgb=True),
                dict(type='Pad', size_divisor=32),
                dict(type='ImageToTensor', keys=['img']),
                dict(type='Collect', keys=['img'])
            ])
    ],
    ann_file=[
        '.....json'
    ],
    filter_empty_gt=False)
traindataset = dict(
    type='CocoDataset',
    data_root='/data',
    classes=('boat', ),
    pipeline=[
        dict(type='LoadImageFromFile'),
        dict(type='LoadAnnotations', with_bbox=True),
        dict(type='Resize', img_scale=(1640, 1232), keep_ratio=True),
        dict(
            type='RandomFlip',
            flip_ratio=0.5,
            direction=['horizontal', 'vertical', 'diagonal']),
        dict(
            type='Normalize',
            mean=[91.622, 120.867, 166.518],
            std=[51.634, 39.997, 42.519],
            to_rgb=True),
        dict(type='Pad', size_divisor=32),
        dict(type='DefaultFormatBundle'),
        dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
    ],
    ann_file=[
        '....json'
    ],
    filter_empty_gt=False)

data = dict(
    samples_per_gpu=2,
    workers_per_gpu=4,
    shuffle=True,
    train=dict(
        type='CocoDataset',
        data_root='/data',
        classes=('boat', ),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='Resize', img_scale=(1640, 1232), keep_ratio=True),
            dict(
                type='RandomFlip',
                flip_ratio=0.5,
                direction=['horizontal', 'vertical', 'diagonal']),
            dict(
                type='Normalize',
                mean=[91.622, 120.867, 166.518],
                std=[51.634, 39.997, 42.519],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ],
        ann_file=[
            '...json'
        ],
        filter_empty_gt=False),
    test=dict(
        type='CocoDataset',
        data_root='/data',
        classes=('boat', ),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1640, 1232),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[91.622, 120.867, 166.518],
                        std=[51.634, 39.997, 42.519],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        ann_file=
        '.json',
        filter_empty_gt=False),
    val=dict(
        type='CocoDataset',
        data_root='/data',
        classes=('boat', ),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1640, 1232),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[91.622, 120.867, 166.518],
                        std=[51.634, 39.997, 42.519],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        ann_file=[
            '.json'
        ],
        filter_empty_gt=False))
model = dict(
    type='FasterRCNN',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[1.5, 3.0, 6.0],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=1,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=False,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.01,
            nms=dict(type='nms', iou_threshold=0.7),
            max_per_img=100)))
optimizer = dict(
    type='SGD',
    lr=0.015,
    momentum=0.9,
    weight_decay=0.0005,
    nesterov=True,
    paramwise_cfg=dict(norm_decay_mult=0.0, bias_decay_mult=0.0))
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[35, 45])
runner = dict(type='EpochBasedRunner', max_epochs=50)
work_dir = './work_dirs/train'
auto_resume = False
gpu_ids = range(0, 4)

RangiLyu commented 2 years ago

Try to return zero loss for classification and regression when there is no gt in the image. You need to keep the keys in the loss dict same between normal images and no gt images.

julled commented 2 years ago

Hi @RangiLyu and thanks for your reply!

i tried to set filter_empty_gt=False in my datasets to keep the examples without GT. Isnt this function supposed to guarantee this?

wywywy01 commented 2 years ago

Try to return zero loss for classification and regression when there is no gt in the image. You need to keep the keys in the loss dict same between normal images and no gt images.

Why not set all loss functions, and return 0 in the loss function when there is no gt, for example-https://github.com/open-mmlab/mmdetection/blob/HEAD/mmdet/models/losses/smooth_l1_loss.py#L47

mZhenz commented 2 years ago

Hi @RangiLyu and thanks for your reply!

i tried to set filter_empty_gt=False in my datasets to keep the examples without GT. Isnt this function supposed to guarantee this?

I got the same question with you. Have you solved this problems by adding filter_empty_gt=False?

julled commented 2 years ago

Hi @RangiLyu and thanks for your reply! i tried to set filter_empty_gt=False in my datasets to keep the examples without GT. Isnt this function supposed to guarantee this?

I got the same question with you. Have you solved this problems by adding filter_empty_gt=False?

no , this didnt help. i think someone needs to implement the proposed changes by @RangiLyu or @wywywy01

julled commented 2 years ago

I thought about this again and setting the loss to 0 would result in that there is no benefit from using the images without GT.

The Idea would be to use those empty images to reduce the false positives so we need some kind of loss.

yiyexy commented 1 year ago

I also met this problem after 2 epochs. Anyone can fix this problem?

julled commented 1 year ago

I also met this problem after 2 epochs. Anyone can fix this problem?

could be that your data is corrupt, check this. this might have saved my problem

liutianen0904 commented 1 year ago

@RangiLyu,大佬，现在有解决的方法吗

amor-volastra commented 1 year ago

Is there any updated on this? @julled could you resolve this issue?
I ran into the same issue using Pytorch 1.13.0+cu117 and mmdetection 2.27.0. Majority of images in my data set doesn't include any GT bboxes. The code runs fine with same data on single GPU. But got the error when running on 4 GPUs cluster. I tried find_unused_parameters=True too, but ddin't help.

julled commented 1 year ago

@amor-volastra not really, but as you mentioned it, i remember that my data also had a lot of examples without any GT bboxes. I actually resolved my problem my lowering the amount of examples without GT bboxes. So maybe, if this is possible for you, you could give this a try.

But better would be that mmdetection could actually handle this. For me without deeper knowledge, this is a hard to find bug.

yimeng436 commented 8 months ago

Hello, I am a novice, my teacher asked me to try to calculate the loss of each gt on the picture separately, and then sum, but I encountered this problem when I was multi-card, I just added the traversal of gt_bbox and gt_list, and this problem occurred. Would you please ask me how to solve it

open-mmlab / mmdetection

'loss log variables are different across GPUs!' HowTo prevent? #7116