Unbalanced gpu memory usage when training

Tianlock commented 4 years ago

Hello, i met a question when i train fasterRCNN. Some training config: 8 2080Ti , 2 imgs per gpu, and 600:800 img scale. I find 1gpu may use 11Gb memory, and others use 5Gb. I think the gpus are not used fully。But if i increase the number of imgs per gpu to 4 or increase the img scale. it will occur OOM error on 1 gpu. However other 7 gpus still have enough memory to use. Could u help me this question. Thanks

Tianlock commented 4 years ago

The memory in the log are different from NVIDIA-SMI memory usage.

ZwwWayne commented 4 years ago

Hi @Tianlock , Could you provide more details about your training? For example, do you use slurm_train.sh or dist_train.sh? Which config you are using? We do not have 2080Ti for now, but we will try to use the same config on 1080Ti to see whether we will meet the same problem.

Tianlock commented 4 years ago

Hi @Tianlock , Could you provide more details about your training? For example, do you use slurm_train.sh or dist_train.sh? Which config you are using? We do not have 2080Ti for now, but we will try to use the same config on 1080Ti to see whether we will meet the same problem.

hello, i use dist_train.sh, and i use faster_rcnn_ohem_r50_fpn_1x.py config.

Tianlock commented 4 years ago

@ZwwWayne hello, i use dist_train.sh, and i use faster_rcnn_ohem_r50_fpn_1x.py config.

Tianlock commented 4 years ago

And i find another another interesting thing, the memory in log trained with fasterRCNN r50 is 10098 and the memory in log trained with fasterRCNN is 10056. The memory cost in r50 is same to r101.

ZwwWayne commented 4 years ago

OK，and can you try faster_rcnn_r50_fpn_1x.py to see whether the memory usage is normal? On 1080Ti, faster_rcnn_r50_fpn_1x.py can run with a batch size of 4, so if you can do that too, then we may be able to locate the bug, which should be in OHEM.

Tianlock commented 4 years ago

OK，and can you try faster_rcnn_r50_fpn_1x.py to see whether the memory usage is normal? On 1080Ti, faster_rcnn_r50_fpn_1x.py can run with a batch size of 4, so if you can do that too, then we may be able to locate the bug, which should be in OHEM.

@ZwwWayne I have tried faster_rcnn_r50_fpn_1x.py, 2 imgs per gpu, and the usage is about 4-5GB in 7gpus and 11gb in 1 gpu. I tried several times, if 2imgs per gpu, if have 50% oom and if 4 imgs per gpu, it will oom 100%. The oom always occurs on 1 gpu

Tianlock commented 4 years ago

@ZwwWayne Sometimes i even can't train faster_rcnn_r50_fpn_1x 2 imgs per gpu and the img scale is 800:1000. It's very strange to oom.

Tianlock commented 4 years ago

@ZwwWayne And i have tried to convert gt to cpu. It can't be the reason that too many gts

ZwwWayne commented 4 years ago

OK, thanks for your report, I am trying to reproduce this phenomenon on my machine.

ZwwWayne commented 4 years ago

Hi @Tianlock ， What pre-trained model are you using? Is it from torchvision or from mmcv? Sometimes the pretrain models are GPU models, which cost extra GPU memory when loading them.

Tianlock commented 4 years ago

Hi @Tianlock ， What pre-trained model are you using? Is it from torchvision or from mmcv? Sometimes the pretrain models are GPU models, which cost extra GPU memory when loading them.

@ZwwWayne i use pre-trained model from torchvision, is it the reason of oom?

ZwwWayne commented 4 years ago

You can have a check for that.

Tianlock commented 4 years ago

You can have a check for that.

hello, @ZwwWayne i have tried to use pre-trained model from open-mmlab, it's also occur oom error. And the report is : RuntimeError: CUDA out of memory. Tried to allocate 8.27 GiB (GPU 0; 10.76 GiB total capacity; 1.61 GiB already allocated; 8.27 GiB free; 1.63 GiB reserved in total by PyTorch)

Tianlock commented 4 years ago

You can have a check for that.

hello, @ZwwWayne i have tried to use pre-trained model from open-mmlab, it's also occur oom error. And the report is : RuntimeError: CUDA out of memory. Tried to allocate 8.27 GiB (GPU 0; 10.76 GiB total capacity; 1.61 GiB already allocated; 8.27 GiB free; 1.63 GiB reserved in total by PyTorch)

And i use faster_rcnn_r101_fpn_1x.py config, imgs scale 800:1000, 2 imgs per gpu. I even can't train the model 1 iter.

ZwwWayne commented 4 years ago

Hi @Tianlock ， Could you provide config to reproduce the OOM ?

YAOYI626 commented 4 years ago

Hi @Tianlock ， Could you provide config to reproduce the OOM ?

I met the same OOM problem with my four 1080Ti GPUs. Here is my config, only 2 imgs per gpu

# fp16 settings
fp16 = dict(loss_scale=512.)

norm_cfg = dict(type='SyncBN', requires_grad=True)
# model settings
model = dict(
    type='FasterRCNN',
    # pretrained='torchvision://resnet50',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=-1,
        norm_cfg=norm_cfg,
        norm_eval=False,
        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        norm_cfg=norm_cfg,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_scales=[8],
        anchor_ratios=[0.5, 1.0, 2.0],
        anchor_strides=[4, 8, 16, 32, 64],
        target_means=[.0, .0, .0, .0],
        target_stds=[1.0, 1.0, 1.0, 1.0],
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0 / 9.0, loss_weight=1.0)),
    bbox_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=7, sample_num=2),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32]),
    bbox_head=dict(
        type='SharedFCBBoxHead',
        num_fcs=2,
        in_channels=256,
        fc_out_channels=1024,
        roi_feat_size=7,
        num_classes=2,
        target_means=[0., 0., 0., 0.],
        target_stds=[0.1, 0.1, 0.2, 0.2],
        reg_class_agnostic=False,
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)))
# model training and testing settings
train_cfg = dict(
    rpn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.7,
            neg_iou_thr=0.3,
            min_pos_iou=0.3,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=256,
            pos_fraction=0.5,
            neg_pos_ub=-1,
            add_gt_as_proposals=False),
        allowed_border=0,
        pos_weight=-1,
        debug=False),
    rpn_proposal=dict(
        nms_across_levels=False,
        nms_pre=2000,
        nms_post=2000,
        max_num=2000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.5,
            min_pos_iou=0.5,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=512,
            pos_fraction=0.25,
            neg_pos_ub=-1,
            add_gt_as_proposals=True),
        pos_weight=-1,
        debug=False))
test_cfg = dict(
    rpn=dict(
        nms_across_levels=False,
        nms_pre=1000,
        nms_post=1000,
        max_num=1000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        score_thr=0.05, nms=dict(type='nms', iou_thr=0.5), max_per_img=100)
    # soft-nms is also supported for rcnn testing
    # e.g., nms=dict(type='soft_nms', iou_thr=0.5, min_score=0.05)
)
# dataset settings
dataset_type = 'WIDERFaceDataset'
data_root = 'data/WIDERFace/'
img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=0,
    train=dict(
        type='RepeatDataset',
        times=2,
        dataset=dict(
            type=dataset_type,
            ann_file=data_root + 'train.txt',
            img_prefix=data_root + 'WIDER_train/',
            # min_size=17,
            pipeline=train_pipeline)),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'val.txt',
        img_prefix=data_root + 'WIDER_val/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'val.txt',
        img_prefix=data_root + 'WIDER_val/',
        pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[8, 11])
checkpoint_config = dict(interval=2)
# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
# runtime settings
total_epochs = 12
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = './work_dirs/wider_face_faster_rcnn_r50_fpn_1x'
load_from = None
resume_from = None
workflow = [('train', 1)]

Thanks!

YAOYI626 commented 4 years ago

@ZwwWayne @hellock I'm sure it's a bug, b\c it still occurs OOM problem even if imgs per gpu = 1

hellock commented 4 years ago

@yhcao6 Please have a check if dist_train.sh works as expected.

yhcao6 commented 4 years ago

Sry I can't reproduce your error. I try to run faster_rcnn_r50 with command ./tools/dist_train.sh configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py 2, here is the screen shot of nvidia-smi: .

yhcao6 commented 4 years ago

Hi @Tianlock ， Could you provide config to reproduce the OOM ?

I met the same OOM problem with my four 1080Ti GPUs. Here is my config, only 2 imgs per gpu

# fp16 settings
fp16 = dict(loss_scale=512.)

norm_cfg = dict(type='SyncBN', requires_grad=True)
# model settings
model = dict(
    type='FasterRCNN',
    # pretrained='torchvision://resnet50',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=-1,
        norm_cfg=norm_cfg,
        norm_eval=False,
        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        norm_cfg=norm_cfg,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_scales=[8],
        anchor_ratios=[0.5, 1.0, 2.0],
        anchor_strides=[4, 8, 16, 32, 64],
        target_means=[.0, .0, .0, .0],
        target_stds=[1.0, 1.0, 1.0, 1.0],
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0 / 9.0, loss_weight=1.0)),
    bbox_roi_extractor=dict(
        type='SingleRoIExtractor',
        roi_layer=dict(type='RoIAlign', out_size=7, sample_num=2),
        out_channels=256,
        featmap_strides=[4, 8, 16, 32]),
    bbox_head=dict(
        type='SharedFCBBoxHead',
        num_fcs=2,
        in_channels=256,
        fc_out_channels=1024,
        roi_feat_size=7,
        num_classes=2,
        target_means=[0., 0., 0., 0.],
        target_stds=[0.1, 0.1, 0.2, 0.2],
        reg_class_agnostic=False,
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)))
# model training and testing settings
train_cfg = dict(
    rpn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.7,
            neg_iou_thr=0.3,
            min_pos_iou=0.3,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=256,
            pos_fraction=0.5,
            neg_pos_ub=-1,
            add_gt_as_proposals=False),
        allowed_border=0,
        pos_weight=-1,
        debug=False),
    rpn_proposal=dict(
        nms_across_levels=False,
        nms_pre=2000,
        nms_post=2000,
        max_num=2000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.5,
            min_pos_iou=0.5,
            ignore_iof_thr=-1),
        sampler=dict(
            type='RandomSampler',
            num=512,
            pos_fraction=0.25,
            neg_pos_ub=-1,
            add_gt_as_proposals=True),
        pos_weight=-1,
        debug=False))
test_cfg = dict(
    rpn=dict(
        nms_across_levels=False,
        nms_pre=1000,
        nms_post=1000,
        max_num=1000,
        nms_thr=0.7,
        min_bbox_size=0),
    rcnn=dict(
        score_thr=0.05, nms=dict(type='nms', iou_thr=0.5), max_per_img=100)
    # soft-nms is also supported for rcnn testing
    # e.g., nms=dict(type='soft_nms', iou_thr=0.5, min_score=0.05)
)
# dataset settings
dataset_type = 'WIDERFaceDataset'
data_root = 'data/WIDERFace/'
img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=0,
    train=dict(
        type='RepeatDataset',
        times=2,
        dataset=dict(
            type=dataset_type,
            ann_file=data_root + 'train.txt',
            img_prefix=data_root + 'WIDER_train/',
            # min_size=17,
            pipeline=train_pipeline)),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'val.txt',
        img_prefix=data_root + 'WIDER_val/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'val.txt',
        img_prefix=data_root + 'WIDER_val/',
        pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[8, 11])
checkpoint_config = dict(interval=2)
# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
# runtime settings
total_epochs = 12
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = './work_dirs/wider_face_faster_rcnn_r50_fpn_1x'
load_from = None
resume_from = None
workflow = [('train', 1)]

Thanks!

Hi, @YAOYI626, I cant run your config in the latest mmdet, could you provide a config that is consistent with the latest branch? I notice that you are using fp16 training. I just try faster_rcnn_r50_fp16 while it is normal. Here is the scrrenshot: Could you also have a try to see if it will cause OOM?

YAOYI626 commented 4 years ago

@yhcao6 thanks for reply!

I have to say sorry 'cause I just found the bug is about WiderFace dataset. It seems too much small bboxes in the batch if we don't limit the min size of bboxes, which causes OOM when calculating IoU. It works well now and no imbalanced problem for me now.

open-mmlab / mmdetection

Unbalanced gpu memory usage when training #2665