Closed Sri0712 closed 2 years ago
Sorry for the late reply. The most likely cause is due to memory usage. It is recommended to reduce workers_per_gpu
and img_scale
. BTW, I saw your config uses batchsize=2 to train the YOLOv3. This is not suitable. YOLOv3 requires a large batch size to train.
Prerequisite
š Describe the bug
Hello, I am trying to train a yolov3 model on a custom dataset on CPU server and the training is keep getting killed after first epoch without giving a proper traceback. Is it because the memory flag is set to 24GB while running the docker container?, if yes, Is there any way to capture the memory that it would be required for the whole training process, with respect to model and dataset, prior? as I have already opened an issue regarding this (https://github.com/open-mmlab/mmdetection/issues/8831), if not, What is the reason for this unknown issue to happen?
Environment
Environment variables set manually inside the container: OMP_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=-1
2022-09-22 11:55:41,567 - mmdet - INFO - Environment info:
sys.platform: linux Python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0] CUDA available: False GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 PyTorch: 1.6.0 PyTorch compiling details: PyTorch built with:
TorchVision: 0.7.0 OpenCV: 4.6.0 MMCV: 1.4.4 MMCV Compiler: GCC 7.4 MMCV CUDA Compiler: not available MMDetection: 2.20.0+
2022-09-22 11:56:14,271 - mmdet - INFO - Config: model = dict( type='YOLOV3', backbone=dict( type='Darknet', depth=53, out_indices=(3, 4, 5), init_cfg=dict(type='Pretrained', checkpoint='open-mmlab://darknet53')), neck=dict( type='YOLOV3Neck', num_scales=3, in_channels=[1024, 512, 256], out_channels=[512, 256, 128]), bbox_head=dict( type='YOLOV3Head', num_classes=23, in_channels=[512, 256, 128], out_channels=[1024, 512, 256], anchor_generator=dict( type='YOLOAnchorGenerator', base_sizes=[[(116, 90), (156, 198), (373, 326)], [(30, 61), (62, 45), (59, 119)], [(10, 13), (16, 30), (33, 23)]], strides=[32, 16, 8]), bbox_coder=dict(type='YOLOBBoxCoder'), featmap_strides=[32, 16, 8], loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0, reduction='sum'), loss_conf=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0, reduction='sum'), loss_xy=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=2.0, reduction='sum'), loss_wh=dict(type='MSELoss', loss_weight=2.0, reduction='sum')), train_cfg=dict( assigner=dict( type='GridAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0)), test_cfg=dict( nms_pre=1000, min_bbox_size=0, score_thr=0.05, conf_thr=0.005, nms=dict(type='nms', iou_threshold=0.45), max_per_img=100)) dataset_type = 'KandulaDataset' data_root = 'data/car_dent/' img_norm_cfg = dict(mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Expand', mean=[0, 0, 0], to_rgb=True, ratio_range=(1, 2)), dict( type='MinIoURandomCrop', min_ious=(0.4, 0.5, 0.6, 0.7, 0.8, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict(type='PhotoMetricDistortion'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=2, workers_per_gpu=6, train=dict( type='KandulaDataset', ann_file='data/car_dent/anno.json', img_prefix='data/car_dent/imgs/', img_filerns='data/car_dent/train.json', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[0, 0, 0], to_rgb=True, ratio_range=(1, 2)), dict( type='MinIoURandomCrop', min_ious=(0.4, 0.5, 0.6, 0.7, 0.8, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict(type='PhotoMetricDistortion'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='KandulaDataset', ann_file='data/car_dent/anno.json', img_prefix='data/car_dent/imgs/', img_filerns='data/car_dent/val.json', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='KandulaDataset', ann_file='data/car_dent/anno.json', img_prefix='data/car_dent/imgs/', img_filerns='data/car_dent/test.json', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) optimizer = dict(type='SGD', lr=0.001, momentum=0.9, weight_decay=0.0005) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=2000, warmup_ratio=0.1, step=[218, 246]) runner = dict(type='EpochBasedRunner', max_epochs=273) evaluation = dict(interval=1, metric=['mAP'], save_best='mAP') checkpoint_config = dict(interval=1) log_config = dict(interval=5, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [ dict(type='NumClassCheckHook'), dict(type='MemoryProfilerHook', interval=5) ] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] seed = 0 gpu_ids = range(0, 1) work_dir = './work_dirs/car_dent_yolov3'
2022-09-22 11:56:16,109 - mmdet - INFO - Start running, host: root@14c9d1e8ddb0, work_dir: /detection/mmdetection/work_dirs/car_dent_yolov3 2022-09-22 11:56:16,112 - mmdet - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
before_train_epoch: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
before_train_iter: (VERY_HIGH ) StepLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook
after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(NORMAL ) MemoryProfilerHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
after_train_epoch: (NORMAL ) CheckpointHook
(NORMAL ) MemoryProfilerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
before_val_epoch: (NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
before_val_iter: (LOW ) IterTimerHook
after_val_iter: (NORMAL ) MemoryProfilerHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
after_val_epoch: (NORMAL ) MemoryProfilerHook
(VERY_LOW ) TextLoggerHook
after_run: (VERY_LOW ) TextLoggerHook
2022-09-22 11:56:16,113 - mmdet - INFO - workflow: [('train', 1)], max: 273 epochs 2022-09-22 11:56:16,114 - mmdet - INFO - Checkpoints will be saved to /detection/mmdetection/work_dirs/car_dent_yolov3 by HardDiskBackend. 2022-09-22 11:56:16,150 - mmdet - WARNING - Please set
CLASSES
in the KandulaDataset andcheck if it is consistent with thenum_classes
of head 2022-09-22 11:57:25,806 - mmdet - INFO - Memory information available_memory: 40880 MB, used_memory: 21945 MB, memory_utilization: 36.4 %, available_swap_memory: 203696 MB, used_swap_memory: 23188 MB, swap_memory_utilization: 10.2 %, current_process_memory: 4564 MB 2022-09-22 11:57:25,914 - mmdet - INFO - Epoch [1][5/993] lr: 1.018e-04, eta: 43 days, 17:02:55, time: 13.931, data_time: 0.540, memory: 4564 MB,loss_cls: 42.6773, loss_conf: 84019.2328, loss_xy: 7.6482, loss_wh: 5.8629, loss: 84075.4250, grad_norm: 262871.7531 ... ...(few lines got cut due long body issue) ... 2022-09-22 14:40:56,284 - mmdet - INFO - Memory information available_memory: 37127 MB, used_memory: 25689 MB, memory_utilization: 42.2 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7722 MB 2022-09-22 14:40:56,387 - mmdet - INFO - Epoch [1][955/993] lr: 5.293e-04, eta: 32 days, 8:18:37, time: 11.271, data_time: 0.035, memory: 7722 MB,loss_cls: 10.4817, loss_conf: 21.1211, loss_xy: 8.6036, loss_wh: 2.5599, loss: 42.7663, grad_norm: 117.3042 2022-09-22 14:41:41,901 - mmdet - INFO - Memory information available_memory: 37104 MB, used_memory: 25715 MB, memory_utilization: 42.3 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7709 MB 2022-09-22 14:41:42,005 - mmdet - INFO - Epoch [1][960/993] lr: 5.315e-04, eta: 32 days, 7:49:06, time: 9.124, data_time: 0.034, memory: 7709 MB,loss_cls: 19.8658, loss_conf: 33.0042, loss_xy: 10.9944, loss_wh: 9.6036, loss: 73.4680, grad_norm: 254.7602 2022-09-22 14:42:34,565 - mmdet - INFO - Memory information available_memory: 37106 MB, used_memory: 25712 MB, memory_utilization: 42.2 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7714 MB 2022-09-22 14:42:34,669 - mmdet - INFO - Epoch [1][965/993] lr: 5.338e-04, eta: 32 days, 7:52:45, time: 10.533, data_time: 0.033, memory: 7714 MB,loss_cls: 7.6567, loss_conf: 14.9974, loss_xy: 5.5317, loss_wh: 2.8979, loss: 31.0837, grad_norm: 117.0246 2022-09-22 14:43:21,999 - mmdet - INFO - Memory information available_memory: 37091 MB, used_memory: 25718 MB, memory_utilization: 42.3 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7708 MB 2022-09-22 14:43:22,103 - mmdet - INFO - Epoch [1][970/993] lr: 5.360e-04, eta: 32 days, 7:32:05, time: 9.487, data_time: 0.033, memory: 7708 MB,loss_cls: 9.4729, loss_conf: 20.0545, loss_xy: 7.0759, loss_wh: 2.7079, loss: 39.3112, grad_norm: 116.0406 2022-09-22 14:44:10,590 - mmdet - INFO - Memory information available_memory: 37096 MB, used_memory: 25717 MB, memory_utilization: 42.3 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7709 MB 2022-09-22 14:44:10,694 - mmdet - INFO - Epoch [1][975/993] lr: 5.383e-04, eta: 32 days, 7:16:58, time: 9.718, data_time: 0.033, memory: 7709 MB,loss_cls: 8.8870, loss_conf: 17.5904, loss_xy: 6.9740, loss_wh: 2.7824, loss: 36.2338, grad_norm: 132.9686 2022-09-22 14:45:00,243 - mmdet - INFO - Memory information available_memory: 37144 MB, used_memory: 25728 MB, memory_utilization: 42.2 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7716 MB 2022-09-22 14:45:00,347 - mmdet - INFO - Epoch [1][980/993] lr: 5.406e-04, eta: 32 days, 7:06:52, time: 9.931, data_time: 0.033, memory: 7716 MB,loss_cls: 9.5101, loss_conf: 19.6238, loss_xy: 7.0440, loss_wh: 3.4792, loss: 39.6572, grad_norm: 124.5163 2022-09-22 14:45:55,099 - mmdet - INFO - Memory information available_memory: 37262 MB, used_memory: 25719 MB, memory_utilization: 42.0 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7717 MB 2022-09-22 14:45:55,203 - mmdet - INFO - Epoch [1][985/993] lr: 5.428e-04, eta: 32 days, 7:20:38, time: 10.971, data_time: 0.034, memory: 7717 MB,loss_cls: 11.4901, loss_conf: 22.3278, loss_xy: 8.5331, loss_wh: 3.9176, loss: 46.2687, grad_norm: 132.7615 2022-09-22 14:46:47,540 - mmdet - INFO - Memory information available_memory: 37361 MB, used_memory: 25729 MB, memory_utilization: 41.9 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7714 MB 2022-09-22 14:46:47,644 - mmdet - INFO - Epoch [1][990/993] lr: 5.450e-04, eta: 32 days, 7:23:17, time: 10.488, data_time: 0.035, memory: 7714 MB,loss_cls: 10.8853, loss_conf: 20.0091, loss_xy: 7.9695, loss_wh: 3.6433, loss: 42.5073, grad_norm: 131.9860 2022-09-22 14:47:14,792 - mmdet - INFO - Saving checkpoint at 1 epochs 2022-09-22 14:57:27,539 - mmdet - INFO - +-------+-----+------+--------+-------+ | class | gts | dets | recall | ap | +-------+-----+------+--------+-------+ | 0 | 1 | 0 | 0.000 | 0.000 | | 1 | 2 | 0 | 0.000 | 0.000 | | 2 | 81 | 510 | 0.086 | 0.002 | | 3 | 1 | 0 | 0.000 | 0.000 | | 4 | 1 | 0 | 0.000 | 0.000 | | 5 | 113 | 505 | 0.018 | 0.000 | | 6 | 73 | 460 | 0.000 | 0.000 | | 7 | 149 | 518 | 0.047 | 0.001 | | 8 | 22 | 16 | 0.000 | 0.000 | | 9 | 43 | 452 | 0.000 | 0.000 | | 10 | 2 | 0 | 0.000 | 0.000 | | 11 | 1 | 0 | 0.000 | 0.000 | | 12 | 7 | 0 | 0.000 | 0.000 | | 13 | 35 | 491 | 0.114 | 0.002 | | 14 | 81 | 484 | 0.025 | 0.000 | | 15 | 38 | 428 | 0.158 | 0.004 | | 16 | 37 | 299 | 0.054 | 0.000 | | 17 | 14 | 32 | 0.000 | 0.000 | | 18 | 27 | 475 | 0.000 | 0.000 | | 19 | 9 | 1 | 0.000 | 0.000 | | 20 | 1 | 0 | 0.000 | 0.000 | | 21 | 34 | 459 | 0.000 | 0.000 | | 22 | 2 | 0 | 0.000 | 0.000 | +-------+-----+------+--------+-------+ | mAP | | | | 0.000 | +-------+-----+------+--------+-------+ 2022-09-22 14:57:28,776 - mmdet - INFO - Now best checkpoint is saved as best_mAP_epoch_1.pth. 2022-09-22 14:57:28,777 - mmdet - INFO - Best mAP is 0.0005 at 1 epoch. 2022-09-22 14:57:28,885 - mmdet - INFO - Exp name: car_dent_yolov3.py 2022-09-22 14:57:28,886 - mmdet - INFO - Epoch(val) [1][543] memory: 7714 MB,AP50: 0.0000, mAP: 0.0005 2022-09-22 14:57:28,942 - mmdet - WARNING - Please setCLASSES
in the KandulaDataset andcheck if it is consistent with thenum_classes
of head KilledAdditional information
No response