open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.68k stars 9.48k forks source link

yolact Out of memory during training #7148

Open Masterzhuior opened 2 years ago

Masterzhuior commented 2 years ago

I am using yolact for training, when an epoch ends, it will be OOM in the Val phase.

My environment is:

The GPU is 3090 (24G Vram)

I just modified the num_classes in file yolact_r50_1x8_coco.py

data = dict(
    samples_per_gpu=8, 
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))

then I started training python tools/train.py configs/yolact/yolact_r50_1x8_coco.py by using my dataset.

My dataset:

But when an epoch ends, it will be out of memory in the Val phase.

2022-02-13 14:38:01,355 - mmdet - INFO - Saving checkpoint at 1 epochs
[                                                  ] 26/1611, 0.6 task/s, elapsed: 43s, ETA:  2636sTraceback (most recent call last):
  File "tools/train.py", line 195, in <module>
    main()
  File "tools/train.py", line 191, in main
    meta=meta)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/apis/train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_epoch')
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
    self._do_evaluate(runner)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/core/evaluation/eval_hooks.py", line 56, in _do_evaluate
    results = single_gpu_test(runner.model, self.dataloader, show=False)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/apis/test.py", line 28, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 174, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/yolact.py", line 113, in simple_test
    rescale=rescale)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/dense_heads/yolact_head.py", line 999, in simple_test
    img_metas[i], rescale)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/dense_heads/yolact_head.py", line 869, in get_seg_masks
    align_corners=False).squeeze(0) > 0.5
RuntimeError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 23.70 GiB total capacity; 19.18 GiB already allocated; 276.56 MiB free; 21.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

After I exchange the test set and verification set, the error report disappears

data = dict(
    samples_per_gpu=8, 
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))

Is there a way I can use to solve this issue? Thanks

hhaAndroid commented 2 years ago

@Masterzhuior It is possible that other programs occupy the GPU memory.

Masterzhuior commented 2 years ago

@Masterzhuior It is possible that other programs occupy the GPU memory.

No, I'm sure no other program is occupying the GPU memory.

ZwwWayne commented 2 years ago

The program fails during inference time. It might because the images are too big, you may try to resize them to a smaller size.

Masterzhuior commented 2 years ago

The program fails during inference time. It might because the images are too big, you may try to resize them to a smaller size.

img_size = 550 

I tried to modify this parameter(550 to 400) in the model , but it still hasn't been solved.

mikaizhu commented 2 years ago

@ZwwWayne hi, how to resize image?

ZwwWayne commented 2 years ago

Modify the Resize module in the pipeline and make the image scale smaller.