open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.14k stars 9.39k forks source link

Extra memory copy during evaluation in colab environment? #5358

Closed lantudou closed 3 years ago

lantudou commented 3 years ago

I tried run the mmdetection in colab environment and build up the lastest version of mmcv and mmdet using following command: !pip install mmcv-full

Collecting mmcv-full
  Downloading https://files.pythonhosted.org/packages/d1/e7/19e84e2223fa997dfb9a0b5fbbb0c91889577ef029acca9448d8e7cc6d74/mmcv-full-1.3.6.tar.gz (309kB)
     |████████████████████████████████| 317kB 23.7MB/s 
Collecting addict
  Downloading https://files.pythonhosted.org/packages/6a/00/b08f23b7d7e1e14ce01419a467b583edbb93c6cdb8654e54a9cc579cd61f/addict-2.4.0-py3-none-any.whl
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from mmcv-full) (1.19.5)
Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from mmcv-full) (7.1.2)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from mmcv-full) (3.13)
Collecting yapf
  Downloading https://files.pythonhosted.org/packages/5f/0d/8814e79eb865eab42d95023b58b650d01dec6f8ea87fc9260978b1bf2167/yapf-0.31.0-py2.py3-none-any.whl (185kB)
     |████████████████████████████████| 194kB 40.1MB/s 
Building wheels for collected packages: mmcv-full
  Building wheel for mmcv-full (setup.py) ... done
  Created wheel for mmcv-full: filename=mmcv_full-1.3.6-cp37-cp37m-linux_x86_64.whl size=27274918 sha256=2b2b44255ae4cbe0e6d4317531192078df21db4d98b8bd36a7e0e47ccba58025
  Stored in directory: /root/.cache/pip/wheels/2e/3d/3e/2ea1242e99b67cc9b86d1cec48c7545ecdc12ab906d1f45836
Successfully built mmcv-full
Installing collected packages: addict, yapf, mmcv-full
Successfully installed addict-2.4.0 mmcv-full-1.3.6 yapf-0.31.0

!git clone https://github.com/open-mmlab/mmdetection.git
%cd mmdetection
!pip install -r requirements/build.txt
!pip install -v -e .  # or "python setup.py develop"
Next, i used following configuration file to train and evaluate my own dataset:
Config:
checkpoint_config = dict(interval=3)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = '/content/mmdetection/checkpointss/cornernet_hourglass104_mstest_32x3_210e_coco_20200819_203110-1efaea91.pth'
resume_from = None
workflow = [('train', 1)]
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile', to_float32=True),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PhotoMetricDistortion',
        brightness_delta=32,
        contrast_range=(0.5, 1.5),
        saturation_range=(0.5, 1.5),
        hue_delta=18),
    dict(
        type='RandomCenterCropPad',
        crop_size=(511, 511),
        ratios=(0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3),
        test_mode=False,
        test_pad_mode=None,
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Resize', img_scale=(511, 511), keep_ratio=False),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile', to_float32=True),
    dict(
        type='MultiScaleFlipAug',
        scale_factor=1.0,
        flip=True,
        transforms=[
            dict(type='Resize'),
            dict(
                type='RandomCenterCropPad',
                crop_size=None,
                ratios=None,
                border=None,
                test_mode=True,
                test_pad_mode=['logical_or', 127],
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(
                type='Collect',
                keys=['img'],
                meta_keys=('filename', 'ori_shape', 'img_shape', 'pad_shape',
                           'scale_factor', 'flip', 'img_norm_cfg', 'border'))
        ])
]
data = dict(
    samples_per_gpu=5,
    workers_per_gpu=3,
    train=dict(
        type='CocoDataset',
        ann_file='/content/instances_train.json',
        img_prefix='/content/3_images/',
        pipeline=[
            dict(type='LoadImageFromFile', to_float32=True),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(
                type='PhotoMetricDistortion',
                brightness_delta=32,
                contrast_range=(0.5, 1.5),
                saturation_range=(0.5, 1.5),
                hue_delta=18),
            dict(
                type='RandomCenterCropPad',
                crop_size=(511, 511),
                ratios=(0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3),
                test_mode=False,
                test_pad_mode=None,
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Resize', img_scale=(511, 511), keep_ratio=False),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ],
        classes=('badge', 'offground', 'ground', 'safebelt')),
    val=dict(
        type='CocoDataset',
        ann_file='/content/instances_validation.json',
        img_prefix='/content/3_images/',
        pipeline=[
            dict(type='LoadImageFromFile', to_float32=True),
            dict(
                type='MultiScaleFlipAug',
                scale_factor=1.0,
                flip=True,
                transforms=[
                    dict(type='Resize'),
                    dict(
                        type='RandomCenterCropPad',
                        crop_size=None,
                        ratios=None,
                        border=None,
                        test_mode=True,
                        test_pad_mode=['logical_or', 127],
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(
                        type='Collect',
                        keys=['img'],
                        meta_keys=('filename', 'ori_shape', 'img_shape',
                                   'pad_shape', 'scale_factor', 'flip',
                                   'img_norm_cfg', 'border'))
                ])
        ],
        classes=('badge', 'offground', 'ground', 'safebelt')),
    test=dict(
        type='CocoDataset',
        ann_file='/content/instances_validation.json',
        img_prefix='/content/3_images/',
        pipeline=[
            dict(type='LoadImageFromFile', to_float32=True),
            dict(
                type='MultiScaleFlipAug',
                scale_factor=1.0,
                flip=True,
                transforms=[
                    dict(type='Resize'),
                    dict(
                        type='RandomCenterCropPad',
                        crop_size=None,
                        ratios=None,
                        border=None,
                        test_mode=True,
                        test_pad_mode=['logical_or', 127],
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(
                        type='Collect',
                        keys=['img'],
                        meta_keys=('filename', 'ori_shape', 'img_shape',
                                   'pad_shape', 'scale_factor', 'flip',
                                   'img_norm_cfg', 'border'))
                ])
        ],
        classes=('badge', 'offground', 'ground', 'safebelt')))
evaluation = dict(interval=1, metric='mAP')
model = dict(
    type='CornerNet',
    backbone=dict(
        type='HourglassNet',
        downsample_times=5,
        num_stacks=2,
        stage_channels=[256, 256, 384, 384, 384, 512],
        stage_blocks=[2, 2, 2, 2, 2, 4],
        norm_cfg=dict(type='BN', requires_grad=True)),
    neck=None,
    bbox_head=dict(
        type='CornerHead',
        num_classes=4,
        in_channels=256,
        num_feat_levels=2,
        corner_emb_channels=1,
        loss_heatmap=dict(
            type='GaussianFocalLoss', alpha=2.0, gamma=4.0, loss_weight=1),
        loss_embedding=dict(
            type='AssociativeEmbeddingLoss', pull_weight=0.1, push_weight=0.1),
        loss_offset=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1)),
    train_cfg=None,
    test_cfg=dict(
        corner_topk=100,
        local_maximum_kernel=3,
        distance_threshold=0.5,
        score_thr=0.05,
        max_per_img=100,
        nms=dict(type='soft_nms', iou_threshold=0.5, method='gaussian')))
optimizer = dict(type='Adam', lr=0.0005)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = None
runner = dict(type='EpochBasedRunner', max_epochs=3)
seed = 0
gpu_ids = range(0, 1)
work_dir = './tutorial_exps'

The training process looks like ok but it is always killed due to cuda out of memory during the evaluation process. It really confused me beacause the memory cost of evaluation should be smaller than training process. Could you tell me where is the problem? Here is my training log as follows:

loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
[
CocoDataset Train dataset with number of images 2359, and instance counts: 
+-----------+-------+---------------+-------+------------+-------+--------------+-------+---------------+-------+
| category  | count | category      | count | category   | count | category     | count | category      | count |
+-----------+-------+---------------+-------+------------+-------+--------------+-------+---------------+-------+
| 0 [badge] | 534   | 1 [offground] | 2297  | 2 [ground] | 1973  | 3 [safebelt] | 1585  | -1 background | 0     |
+-----------+-------+---------------+-------+------------+-------+--------------+-------+---------------+-------+]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
2021-06-15 03:03:19,187 - mmdet - INFO - load checkpoint from /content/mmdetection/checkpointss/cornernet_hourglass104_mstest_32x3_210e_coco_20200819_203110-1efaea91.pth
2021-06-15 03:03:19,188 - mmdet - INFO - Use load_from_local loader
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2021-06-15 03:03:19,870 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for bbox_head.tl_heat.0.1.conv.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([4, 256, 1, 1]).
size mismatch for bbox_head.tl_heat.0.1.conv.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for bbox_head.tl_heat.1.1.conv.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([4, 256, 1, 1]).
size mismatch for bbox_head.tl_heat.1.1.conv.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for bbox_head.br_heat.0.1.conv.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([4, 256, 1, 1]).
size mismatch for bbox_head.br_heat.0.1.conv.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for bbox_head.br_heat.1.1.conv.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([4, 256, 1, 1]).
size mismatch for bbox_head.br_heat.1.1.conv.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([4]).
2021-06-15 03:03:19,892 - mmdet - INFO - Start running, host: root@98248b24902f, work_dir: /content/mmdetection/tutorial_exps
2021-06-15 03:03:19,896 - mmdet - INFO - workflow: [('train', 1)], max: 3 epochs
2021-06-15 03:05:50,274 - mmdet - INFO - Epoch [1][50/472]  lr: 5.000e-04, eta: 1:08:27, time: 3.007, data_time: 0.528, memory: 12353, det_loss: 8992.2734, off_loss: 0.0547, pull_loss: 0.1482, push_loss: 0.1249, loss: 8992.6011, grad_norm: 58807.2402
2021-06-15 03:08:06,811 - mmdet - INFO - Epoch [1][100/472] lr: 5.000e-04, eta: 1:02:55, time: 2.731, data_time: 0.168, memory: 12353, det_loss: 8350.7865, off_loss: 0.0566, pull_loss: 0.1095, push_loss: 0.1087, loss: 8351.0612, grad_norm: 284771.1289
2021-06-15 03:10:32,646 - mmdet - INFO - Epoch [1][150/472] lr: 5.000e-04, eta: 1:00:52, time: 2.917, data_time: 0.326, memory: 12353, det_loss: 4.4549, off_loss: 0.0563, pull_loss: 0.0388, push_loss: 0.1400, loss: 4.6899, grad_norm: 39.0529
2021-06-15 03:12:57,682 - mmdet - INFO - Epoch [1][200/472] lr: 5.000e-04, eta: 0:58:32, time: 2.901, data_time: 0.317, memory: 12353, det_loss: 2.9102, off_loss: 0.0559, pull_loss: 0.0209, push_loss: 0.1547, loss: 3.1417, grad_norm: 13.5121
2021-06-15 03:15:22,587 - mmdet - INFO - Epoch [1][250/472] lr: 5.000e-04, eta: 0:56:10, time: 2.898, data_time: 0.298, memory: 12353, det_loss: 2.4731, off_loss: 0.0550, pull_loss: 0.0265, push_loss: 0.1893, loss: 2.7440, grad_norm: 11.7897
2021-06-15 03:17:43,095 - mmdet - INFO - Epoch [1][300/472] lr: 5.000e-04, eta: 0:53:31, time: 2.810, data_time: 0.190, memory: 12353, det_loss: 2.5827, off_loss: 0.0588, pull_loss: 0.0278, push_loss: 0.1520, loss: 2.8213, grad_norm: 12.1198
2021-06-15 03:20:04,435 - mmdet - INFO - Epoch [1][350/472] lr: 5.000e-04, eta: 0:50:59, time: 2.827, data_time: 0.222, memory: 12353, det_loss: 2.3108, off_loss: 0.0573, pull_loss: 0.0262, push_loss: 0.1266, loss: 2.5208, grad_norm: 11.8928
2021-06-15 03:22:26,408 - mmdet - INFO - Epoch [1][400/472] lr: 5.000e-04, eta: 0:48:32, time: 2.839, data_time: 0.218, memory: 12353, det_loss: 2.2667, off_loss: 0.0566, pull_loss: 0.0189, push_loss: 0.1403, loss: 2.4825, grad_norm: 11.1605
2021-06-15 03:24:44,320 - mmdet - INFO - Epoch [1][450/472] lr: 5.000e-04, eta: 0:45:57, time: 2.758, data_time: 0.159, memory: 12353, det_loss: 2.3864, off_loss: 0.0589, pull_loss: 0.0260, push_loss: 0.1231, loss: 2.5944, grad_norm: 11.4065
[                                                  ] 0/187, elapsed: 0s, ETA:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-21-af7c7813795f> in <module>()
     16 # Create work_dir
     17 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
---> 18 train_detector(model, datasets, cfg, distributed=False, validate=True)

25 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
    394                             _pair(0), self.dilation, self.groups)
    395         return F.conv2d(input, weight, bias, self.stride,
--> 396                         self.padding, self.dilation, self.groups)
    397 
    398     def forward(self, input: Tensor) -> Tensor:

RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 14.76 GiB total capacity; 11.34 GiB already allocated; 1.01 GiB free; 12.71 GiB reserved in total by PyTorch)

If it is necessary, i would like to share my colab code in here to reproduce the bug. Kind Regards!

hhaAndroid commented 3 years ago

Hi @lantudou CornerNet is very memory consuming, you can consider resize or modify batch, and then observe whether there is the same problem.

lantudou commented 3 years ago

Hi @lantudou CornerNet is very memory consuming, you can consider resize or modify batch, and then observe whether there is the same problem.

Yeah..but i just want to know whether the evaluation process need extra cuda memory during the training. In other word, is it a normal situation or a code bug?

hhaAndroid commented 3 years ago

Hi @lantudou CornerNet is very memory consuming, you can consider resize or modify batch, and then observe whether there is the same problem.

Yeah..but i just want to know whether the evaluation process need extra cuda memory during the training. In other word, is it a normal situation or a code bug?

This is not easy to judge, but we have no problem training and verifying on v100.

ousinkou commented 3 years ago

@hhaAndroid v100 have large memory, of course no problem. Evaluation should be like detectron2, when there is not much memory left, the evaluation process should be transferred to cpu.