openvinotoolkit / training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
https://openvinotoolkit.github.io/training_extensions/
Apache License 2.0
1.14k stars 443 forks source link

CUDA out of memory #749

Closed dariocf1 closed 2 years ago

dariocf1 commented 2 years ago

Hi, I've trying to create a training with Custom object detector, but I got the following error:

RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 3.95 GiB total capacity; 2.61 GiB already allocated; 286.69 MiB free; 2.64 GiB reserved in total by PyTorch)


python train.py    --load-weights ${WORK_DIR}/snapshot.pth    --train-ann-files ${TRAIN_ANN_FILE}    --train-data-roots ${TRAIN_IMG_ROOT}    --val-ann-files ${VAL_ANN_FILE}    --val-data-roots ${VAL_IMG_ROOT}    --save-checkpoints-to ${WORK_DIR}/outputs    --classes ${CLASSES}
WARNING:root:Set of classes that will be used in current training does not equal to classes stored in snapshot: ['damage'] vs []
INFO:root:Commandline:
train.py --load-weights /tmp/damages/snapshot.pth --train-ann-files /home/user/Documents/damages/dataset/train/annotations/instances_default.json --train-data-roots /home/user/Documents/damages/dataset/train/images --val-ann-files /home/user/Documents/damages/dataset/validation/annotations/instances_default.json --val-data-roots /home/user/Documents/damages/dataset/validation/images --save-checkpoints-to /tmp/damages/outputs --classes damage
INFO:root:Training started ...
INFO:root:Training on GPUs started ...
WARNING:root:available_gpu_num < args.gpu_num: 1 < 3
WARNING:root:decreased number of gpu to: 1
fatal: not a git repository (or any of the parent directories): .git
2021-12-01 12:50:03,091 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
CUDA available: True
GPU 0: NVIDIA GeForce GTX 1050
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1+cu102
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.9.1+cu102
OpenCV: 4.5.3-openvino
MMCV: 1.3.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.2
MMDetection: 2.9.0+
MMDetection Compiler: GCC 7.5
MMDetection CUDA Compiler: 10.2
NNCF: 1.7.0
ONNX: 1.10.2
ONNXRuntime: 1.9.0
OpenVINO MO: 2021.4.1-3926-14e67d86634-releases/2021/4
OpenVINO IE: 2021.4.1-3926-14e67d86634-releases/2021/4
------------------------------------------------------------

INFO:mmdet:Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
CUDA available: True
GPU 0: NVIDIA GeForce GTX 1050
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1+cu102
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.9.1+cu102
OpenCV: 4.5.3-openvino
MMCV: 1.3.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.2
MMDetection: 2.9.0+
MMDetection Compiler: GCC 7.5
MMDetection CUDA Compiler: 10.2
NNCF: 1.7.0
ONNX: 1.10.2
ONNXRuntime: 1.9.0
OpenVINO MO: 2021.4.1-3926-14e67d86634-releases/2021/4
OpenVINO IE: 2021.4.1-3926-14e67d86634-releases/2021/4
------------------------------------------------------------

2021-12-01 12:50:03,985 - mmdet - INFO - Distributed training: True
INFO:mmdet:Distributed training: True
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Use load_from_local loader
2021-12-01 12:50:04,862 - mmdet - INFO - Config:
input_size = 512
image_width = 512
image_height = 512
width_mult = 1.0
model = dict(
    type='SingleStageDetector',
    backbone=dict(
        type='mobilenetv2_w1',
        out_indices=(4, 5),
        frozen_stages=-1,
        norm_eval=False,
        pretrained=True),
    neck=None,
    bbox_head=dict(
        type='SSDHead',
        num_classes=1,
        in_channels=(96, 320),
        anchor_generator=dict(
            type='SSDAnchorGeneratorClustered',
            strides=(16, 32),
            widths=[[
                23.554248425206367, 54.312675122672, 156.8199838472748,
                85.79076150022739
            ],
                    [
                        126.29684895774292, 230.92962052918818,
                        426.98291390718117, 276.4491073812946,
                        469.60729751113075
                    ]],
            heights=[[
                29.534106270311696, 90.99895689425296, 91.96346785149395,
                197.3348624823917
            ],
                     [
                         354.49167554782616, 221.60634559442957,
                         191.70668631632822, 413.72951531676006,
                         440.6051718003978
                     ]]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=(0.0, 0.0, 0.0, 0.0),
            target_stds=(0.1, 0.1, 0.2, 0.2)),
        depthwise_heads=True,
        depthwise_heads_activations='relu',
        loss_balancing=True),
    train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.4,
            neg_iou_thr=0.4,
            min_pos_iou=0.0,
            ignore_iof_thr=-1,
            gt_max_assign_all=False),
        smoothl1_beta=1.0,
        use_giou=False,
        use_focal=False,
        allowed_border=-1,
        pos_weight=-1,
        neg_pos_ratio=3,
        debug=False),
    test_cfg=dict(
        nms=dict(type='nms', iou_threshold=0.45),
        min_bbox_size=0,
        score_thr=0.02,
        max_per_img=200,
        nms_pre_classwise=200))
cudnn_benchmark = True
dataset_type = 'CocoDataset'
img_norm_cfg = dict(mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile', to_float32=True),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PhotoMetricDistortion',
        brightness_delta=32,
        contrast_range=(0.5, 1.5),
        saturation_range=(0.5, 1.5),
        hue_delta=18),
    dict(
        type='MinIoURandomCrop',
        min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
        min_crop_size=0.1),
    dict(type='Resize', img_scale=(512, 512), keep_ratio=False),
    dict(type='Normalize', mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(512, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=False),
            dict(
                type='Normalize',
                mean=[0, 0, 0],
                std=[255, 255, 255],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=32,
    workers_per_gpu=4,
    train=dict(
        type='RepeatDataset',
        times=5,
        dataset=dict(
            type='CocoDataset',
            ann_file=
            '/home/user/Documents/damages/dataset/train/annotations/instances_default.json',
            img_prefix='/home/user/Documents/damages/dataset/train/images',
            pipeline=[
                dict(type='LoadImageFromFile', to_float32=True),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(
                    type='PhotoMetricDistortion',
                    brightness_delta=32,
                    contrast_range=(0.5, 1.5),
                    saturation_range=(0.5, 1.5),
                    hue_delta=18),
                dict(
                    type='MinIoURandomCrop',
                    min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
                    min_crop_size=0.1),
                dict(type='Resize', img_scale=(512, 512), keep_ratio=False),
                dict(
                    type='Normalize',
                    mean=[0, 0, 0],
                    std=[255, 255, 255],
                    to_rgb=True),
                dict(type='RandomFlip', flip_ratio=0.5),
                dict(type='DefaultFormatBundle'),
                dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
            ],
            classes=['damage'])),
    val=dict(
        type='CocoDataset',
        ann_file=
        '/home/user/Documents/damages/dataset/validation/annotations/instances_default.json',
        img_prefix='/home/user/Documents/damages/dataset/validation/images',
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(512, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=False),
                    dict(
                        type='Normalize',
                        mean=[0, 0, 0],
                        std=[255, 255, 255],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=['damage']),
    test=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017',
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(512, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=False),
                    dict(
                        type='Normalize',
                        mean=[0, 0, 0],
                        std=[255, 255, 255],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=['damage']))
optimizer = dict(type='SGD', lr=0.05, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=1e-05,
    warmup='linear',
    warmup_iters=100,
    warmup_ratio=0.1)
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=10,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
total_epochs = 15
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/tmp/damages/outputs'
load_from = '/tmp/damages/snapshot.pth'
resume_from = ''
workflow = [('train', 1)]
gpu_ids = range(0, 1)

INFO:mmdet:Config:
input_size = 512
image_width = 512
image_height = 512
width_mult = 1.0
model = dict(
    type='SingleStageDetector',
    backbone=dict(
        type='mobilenetv2_w1',
        out_indices=(4, 5),
        frozen_stages=-1,
        norm_eval=False,
        pretrained=True),
    neck=None,
    bbox_head=dict(
        type='SSDHead',
        num_classes=1,
        in_channels=(96, 320),
        anchor_generator=dict(
            type='SSDAnchorGeneratorClustered',
            strides=(16, 32),
            widths=[[
                23.554248425206367, 54.312675122672, 156.8199838472748,
                85.79076150022739
            ],
                    [
                        126.29684895774292, 230.92962052918818,
                        426.98291390718117, 276.4491073812946,
                        469.60729751113075
                    ]],
            heights=[[
                29.534106270311696, 90.99895689425296, 91.96346785149395,
                197.3348624823917
            ],
                     [
                         354.49167554782616, 221.60634559442957,
                         191.70668631632822, 413.72951531676006,
                         440.6051718003978
                     ]]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=(0.0, 0.0, 0.0, 0.0),
            target_stds=(0.1, 0.1, 0.2, 0.2)),
        depthwise_heads=True,
        depthwise_heads_activations='relu',
        loss_balancing=True),
    train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.4,
            neg_iou_thr=0.4,
            min_pos_iou=0.0,
            ignore_iof_thr=-1,
            gt_max_assign_all=False),
        smoothl1_beta=1.0,
        use_giou=False,
        use_focal=False,
        allowed_border=-1,
        pos_weight=-1,
        neg_pos_ratio=3,
        debug=False),
    test_cfg=dict(
        nms=dict(type='nms', iou_threshold=0.45),
        min_bbox_size=0,
        score_thr=0.02,
        max_per_img=200,
        nms_pre_classwise=200))
cudnn_benchmark = True
dataset_type = 'CocoDataset'
img_norm_cfg = dict(mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile', to_float32=True),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PhotoMetricDistortion',
        brightness_delta=32,
        contrast_range=(0.5, 1.5),
        saturation_range=(0.5, 1.5),
        hue_delta=18),
    dict(
        type='MinIoURandomCrop',
        min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
        min_crop_size=0.1),
    dict(type='Resize', img_scale=(512, 512), keep_ratio=False),
    dict(type='Normalize', mean=[0, 0, 0], std=[255, 255, 255], to_rgb=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(512, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=False),
            dict(
                type='Normalize',
                mean=[0, 0, 0],
                std=[255, 255, 255],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=32,
    workers_per_gpu=4,
    train=dict(
        type='RepeatDataset',
        times=5,
        dataset=dict(
            type='CocoDataset',
            ann_file=
            '/home/user/Documents/damages/dataset/train/annotations/instances_default.json',
            img_prefix='/home/user/Documents/damages/dataset/train/images',
            pipeline=[
                dict(type='LoadImageFromFile', to_float32=True),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(
                    type='PhotoMetricDistortion',
                    brightness_delta=32,
                    contrast_range=(0.5, 1.5),
                    saturation_range=(0.5, 1.5),
                    hue_delta=18),
                dict(
                    type='MinIoURandomCrop',
                    min_ious=(0.1, 0.3, 0.5, 0.7, 0.9),
                    min_crop_size=0.1),
                dict(type='Resize', img_scale=(512, 512), keep_ratio=False),
                dict(
                    type='Normalize',
                    mean=[0, 0, 0],
                    std=[255, 255, 255],
                    to_rgb=True),
                dict(type='RandomFlip', flip_ratio=0.5),
                dict(type='DefaultFormatBundle'),
                dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
            ],
            classes=['damage'])),
    val=dict(
        type='CocoDataset',
        ann_file=
        '/home/user/Documents/damages/dataset/validation/annotations/instances_default.json',
        img_prefix='/home/user/Documents/damages/dataset/validation/images',
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(512, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=False),
                    dict(
                        type='Normalize',
                        mean=[0, 0, 0],
                        std=[255, 255, 255],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=['damage']),
    test=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017',
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(512, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=False),
                    dict(
                        type='Normalize',
                        mean=[0, 0, 0],
                        std=[255, 255, 255],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=['damage']))
optimizer = dict(type='SGD', lr=0.05, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=1e-05,
    warmup='linear',
    warmup_iters=100,
    warmup_ratio=0.1)
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=10,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
total_epochs = 15
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/tmp/damages/outputs'
load_from = '/tmp/damages/snapshot.pth'
resume_from = ''
workflow = [('train', 1)]
gpu_ids = range(0, 1)

fatal: not a git repository (or any of the parent directories): .git
The model and loaded state dict do not match exactly

size mismatch for bbox_head.cls_convs.0.3.weight: copying a param with shape torch.Size([324, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([8, 96, 1, 1]).
size mismatch for bbox_head.cls_convs.0.3.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([8]).
size mismatch for bbox_head.cls_convs.1.3.weight: copying a param with shape torch.Size([405, 320, 1, 1]) from checkpoint, the shape in current model is torch.Size([10, 320, 1, 1]).
size mismatch for bbox_head.cls_convs.1.3.bias: copying a param with shape torch.Size([405]) from checkpoint, the shape in current model is torch.Size([10]).
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2021-12-01 12:50:08,908 - mmdet - INFO - Start running, host: user@BraingineNitro, work_dir: /tmp/damages/outputs
INFO:mmdet:Start running, host: user@BraingineNitro, work_dir: /tmp/damages/outputs
2021-12-01 12:50:08,908 - mmdet - INFO - workflow: [('train', 1)], max: 15 epochs
INFO:mmdet:workflow: [('train', 1)], max: 15 epochs
Traceback (most recent call last):
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/tools/train.py", line 339, in <module>
    main()
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/tools/train.py", line 335, in main
    meta=meta)
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/mmdet/apis/train.py", line 220, in train_detector
    runner.run(data_loaders, cfg.workflow, compression_ctrl=compression_ctrl)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 51, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/mmdet/models/detectors/base.py", line 365, in train_step
    losses = self(**data)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/nncf/dynamic_graph/wrappers.py", line 106, in wrapped
    return module_call(self, *args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/mmdet/models/detectors/base.py", line 212, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/mmdet/models/detectors/single_stage.py", line 96, in forward_train
    x = self.extract_feat(img)
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/mmdet/models/detectors/single_stage.py", line 57, in extract_feat
    x = self.backbone(img)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/nncf/dynamic_graph/wrappers.py", line 106, in wrapped
    return module_call(self, *args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/damages/training_extensions/external/mmdetection/mmdet/models/backbones/imgclsmob.py", line 41, in multioutput_forward
    y = stage(y)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/nncf/dynamic_graph/wrappers.py", line 106, in wrapped
    return module_call(self, *args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/nncf/dynamic_graph/wrappers.py", line 106, in wrapped
    return module_call(self, *args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/pytorchcv/models/mobilenetv2.py", line 55, in forward
    x = self.conv1(x)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/nncf/dynamic_graph/wrappers.py", line 106, in wrapped
    return module_call(self, *args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/pytorchcv/models/common.py", line 265, in forward
    x = self.bn(x)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/nncf/dynamic_graph/wrappers.py", line 106, in wrapped
    return module_call(self, *args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 140, in forward
    self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/nncf/dynamic_graph/wrappers.py", line 44, in wrapped
    op1 = operator(*args, **kwargs)
  File "/home/user/Documents/damages/training_extensions/models/object_detection/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 2150, in batch_norm
    input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 3.95 GiB total capacity; 2.61 GiB already allocated; 286.69 MiB free; 2.64 GiB reserved in total by PyTorch)

Terminated because of: CUDA out of memory
Killing subprocess 4795
Main process received SIGTERM, exiting
Terminated
´´´
morkovka1337 commented 2 years ago

Hi,

This error means that you do not have enough GPU memory to train with such batch size. Please, decrease the samples_per_gpu value in the config file, say, make it samples_per_gpu=16 for example and look, if it works. If it does, you can increase a bit until you meet the same error. If it does not, decrease it even more (8, 6, 4, etc)

dariocf1 commented 2 years ago

Thank you,

I already modified the model.py file, changing the samples_per_gpu, but it seems is not changed, still in 32, Where do I find the config file that is loaded for the training?

also I have changed the value in the file models/object_detection/model_templates/custom-object-detection/mobilenet_v2-2s_ssd-512x512/model.py and there is no change.

morkovka1337 commented 2 years ago

When you have instantiated the template, there should exist $WORK_DIR, where you have train.py file. In the same folder there is model.py file, that is needed to be changed.

I already modified the model.py file, changing the samples_per_gpu, but it seems is not changed, still in 32, Where do I find the config file that is loaded for the training?

Have you changed the model.py that is in the template folder?

dariocf1 commented 2 years ago

Yes, I have changed the value in the path and the file in my training folder, also I changed all the samples_per_gpu in the folder with the git clone. After change all, still running with 32

morkovka1337 commented 2 years ago

I have just instantiated template and tried to change batch size. I confirm the problem that it is not being changed, will investigate this.

morkovka1337 commented 2 years ago

Ok, I've got it. Instead of changing samples_per_gpu in the model.py, you need to change batch_size in the template.yaml. Sorry for confusing you.

also I changed all the samples_per_gpu in the folder with the git clone. After change all, still running with 32

No need to do it. Once the template is instantiated, all values are read from the WORK_DIR, configs in the source repo have no effect.