CPU training got killed in docker container

Prerequisite

[X] I have searched the existing and past issues but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

🐞 Describe the bug

Hello, I am trying to train a yolov3 model on a custom dataset on CPU server and the training is keep getting killed after first epoch without giving a proper traceback. Is it because the memory flag is set to 24GB while running the docker container?, if yes, Is there any way to capture the memory that it would be required for the whole training process, with respect to model and dataset, prior? as I have already opened an issue regarding this (https://github.com/open-mmlab/mmdetection/issues/8831), if not, What is the reason for this unknown issue to happen?

Environment

Environment variables set manually inside the container: OMP_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=-1

2022-09-22 11:55:41,567 - mmdet - INFO - Environment info:

sys.platform: linux Python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0] CUDA available: False GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 PyTorch: 1.6.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.7.0 OpenCV: 4.6.0 MMCV: 1.4.4 MMCV Compiler: GCC 7.4 MMCV CUDA Compiler: not available MMDetection: 2.20.0+

2022-09-22 11:56:14,271 - mmdet - INFO - Config: model = dict( type='YOLOV3', backbone=dict( type='Darknet', depth=53, out_indices=(3, 4, 5), init_cfg=dict(type='Pretrained', checkpoint='open-mmlab://darknet53')), neck=dict( type='YOLOV3Neck', num_scales=3, in_channels=[1024, 512, 256], out_channels=[512, 256, 128]), bbox_head=dict( type='YOLOV3Head', num_classes=23, in_channels=[512, 256, 128], out_channels=[1024, 512, 256], anchor_generator=dict( type='YOLOAnchorGenerator', base_sizes=[[(116, 90), (156, 198), (373, 326)], [(30, 61), (62, 45), (59, 119)], [(10, 13), (16, 30), (33, 23)]], strides=[32, 16, 8]), bbox_coder=dict(type='YOLOBBoxCoder'), featmap_strides=[32, 16, 8], loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0, reduction='sum'), loss_conf=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0, reduction='sum'), loss_xy=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=2.0, reduction='sum'), loss_wh=dict(type='MSELoss', loss_weight=2.0, reduction='sum')), train_cfg=dict( assigner=dict( type='GridAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0)), test_cfg=dict( nms_pre=1000, min_bbox_size=0, score_thr=0.05, conf_thr=0.005, nms=dict(type='nms', iou_threshold=0.45), max_per_img=100)) dataset_type = 'KandulaDataset' data_root = 'data/car_dent/' img_norm_cfg = dict(mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Expand', mean=[0, 0, 0], to_rgb=True, ratio_range=(1, 2)), dict( type='MinIoURandomCrop', min_ious=(0.4, 0.5, 0.6, 0.7, 0.8, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict(type='PhotoMetricDistortion'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=2, workers_per_gpu=6, train=dict( type='KandulaDataset', ann_file='data/car_dent/anno.json', img_prefix='data/car_dent/imgs/', img_filerns='data/car_dent/train.json', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[0, 0, 0], to_rgb=True, ratio_range=(1, 2)), dict( type='MinIoURandomCrop', min_ious=(0.4, 0.5, 0.6, 0.7, 0.8, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict(type='PhotoMetricDistortion'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='KandulaDataset', ann_file='data/car_dent/anno.json', img_prefix='data/car_dent/imgs/', img_filerns='data/car_dent/val.json', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='KandulaDataset', ann_file='data/car_dent/anno.json', img_prefix='data/car_dent/imgs/', img_filerns='data/car_dent/test.json', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0, 0, 0], std=[255.0, 255.0, 255.0], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) optimizer = dict(type='SGD', lr=0.001, momentum=0.9, weight_decay=0.0005) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=2000, warmup_ratio=0.1, step=[218, 246]) runner = dict(type='EpochBasedRunner', max_epochs=273) evaluation = dict(interval=1, metric=['mAP'], save_best='mAP') checkpoint_config = dict(interval=1) log_config = dict(interval=5, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [ dict(type='NumClassCheckHook'), dict(type='MemoryProfilerHook', interval=5) ] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] seed = 0 gpu_ids = range(0, 1) work_dir = './work_dirs/car_dent_yolov3'

2022-09-22 11:56:16,109 - mmdet - INFO - Start running, host: root@14c9d1e8ddb0, work_dir: /detection/mmdetection/work_dirs/car_dent_yolov3 2022-09-22 11:56:16,112 - mmdet - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

before_train_epoch: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

before_train_iter: (VERY_HIGH ) StepLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook

after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(NORMAL ) MemoryProfilerHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

after_train_epoch: (NORMAL ) CheckpointHook
(NORMAL ) MemoryProfilerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

before_val_epoch: (NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_val_iter: (LOW ) IterTimerHook

after_val_iter: (NORMAL ) MemoryProfilerHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

after_val_epoch: (NORMAL ) MemoryProfilerHook
(VERY_LOW ) TextLoggerHook

after_run: (VERY_LOW ) TextLoggerHook

2022-09-22 11:56:16,113 - mmdet - INFO - workflow: [('train', 1)], max: 273 epochs 2022-09-22 11:56:16,114 - mmdet - INFO - Checkpoints will be saved to /detection/mmdetection/work_dirs/car_dent_yolov3 by HardDiskBackend. 2022-09-22 11:56:16,150 - mmdet - WARNING - Please set CLASSES in the KandulaDataset andcheck if it is consistent with the num_classes of head 2022-09-22 11:57:25,806 - mmdet - INFO - Memory information available_memory: 40880 MB, used_memory: 21945 MB, memory_utilization: 36.4 %, available_swap_memory: 203696 MB, used_swap_memory: 23188 MB, swap_memory_utilization: 10.2 %, current_process_memory: 4564 MB 2022-09-22 11:57:25,914 - mmdet - INFO - Epoch [1][5/993] lr: 1.018e-04, eta: 43 days, 17:02:55, time: 13.931, data_time: 0.540, memory: 4564 MB,loss_cls: 42.6773, loss_conf: 84019.2328, loss_xy: 7.6482, loss_wh: 5.8629, loss: 84075.4250, grad_norm: 262871.7531 ... ...(few lines got cut due long body issue) ... 2022-09-22 14:40:56,284 - mmdet - INFO - Memory information available_memory: 37127 MB, used_memory: 25689 MB, memory_utilization: 42.2 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7722 MB 2022-09-22 14:40:56,387 - mmdet - INFO - Epoch [1][955/993] lr: 5.293e-04, eta: 32 days, 8:18:37, time: 11.271, data_time: 0.035, memory: 7722 MB,loss_cls: 10.4817, loss_conf: 21.1211, loss_xy: 8.6036, loss_wh: 2.5599, loss: 42.7663, grad_norm: 117.3042 2022-09-22 14:41:41,901 - mmdet - INFO - Memory information available_memory: 37104 MB, used_memory: 25715 MB, memory_utilization: 42.3 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7709 MB 2022-09-22 14:41:42,005 - mmdet - INFO - Epoch [1][960/993] lr: 5.315e-04, eta: 32 days, 7:49:06, time: 9.124, data_time: 0.034, memory: 7709 MB,loss_cls: 19.8658, loss_conf: 33.0042, loss_xy: 10.9944, loss_wh: 9.6036, loss: 73.4680, grad_norm: 254.7602 2022-09-22 14:42:34,565 - mmdet - INFO - Memory information available_memory: 37106 MB, used_memory: 25712 MB, memory_utilization: 42.2 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7714 MB 2022-09-22 14:42:34,669 - mmdet - INFO - Epoch [1][965/993] lr: 5.338e-04, eta: 32 days, 7:52:45, time: 10.533, data_time: 0.033, memory: 7714 MB,loss_cls: 7.6567, loss_conf: 14.9974, loss_xy: 5.5317, loss_wh: 2.8979, loss: 31.0837, grad_norm: 117.0246 2022-09-22 14:43:21,999 - mmdet - INFO - Memory information available_memory: 37091 MB, used_memory: 25718 MB, memory_utilization: 42.3 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7708 MB 2022-09-22 14:43:22,103 - mmdet - INFO - Epoch [1][970/993] lr: 5.360e-04, eta: 32 days, 7:32:05, time: 9.487, data_time: 0.033, memory: 7708 MB,loss_cls: 9.4729, loss_conf: 20.0545, loss_xy: 7.0759, loss_wh: 2.7079, loss: 39.3112, grad_norm: 116.0406 2022-09-22 14:44:10,590 - mmdet - INFO - Memory information available_memory: 37096 MB, used_memory: 25717 MB, memory_utilization: 42.3 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7709 MB 2022-09-22 14:44:10,694 - mmdet - INFO - Epoch [1][975/993] lr: 5.383e-04, eta: 32 days, 7:16:58, time: 9.718, data_time: 0.033, memory: 7709 MB,loss_cls: 8.8870, loss_conf: 17.5904, loss_xy: 6.9740, loss_wh: 2.7824, loss: 36.2338, grad_norm: 132.9686 2022-09-22 14:45:00,243 - mmdet - INFO - Memory information available_memory: 37144 MB, used_memory: 25728 MB, memory_utilization: 42.2 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7716 MB 2022-09-22 14:45:00,347 - mmdet - INFO - Epoch [1][980/993] lr: 5.406e-04, eta: 32 days, 7:06:52, time: 9.931, data_time: 0.033, memory: 7716 MB,loss_cls: 9.5101, loss_conf: 19.6238, loss_xy: 7.0440, loss_wh: 3.4792, loss: 39.6572, grad_norm: 124.5163 2022-09-22 14:45:55,099 - mmdet - INFO - Memory information available_memory: 37262 MB, used_memory: 25719 MB, memory_utilization: 42.0 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7717 MB 2022-09-22 14:45:55,203 - mmdet - INFO - Epoch [1][985/993] lr: 5.428e-04, eta: 32 days, 7:20:38, time: 10.971, data_time: 0.034, memory: 7717 MB,loss_cls: 11.4901, loss_conf: 22.3278, loss_xy: 8.5331, loss_wh: 3.9176, loss: 46.2687, grad_norm: 132.7615 2022-09-22 14:46:47,540 - mmdet - INFO - Memory information available_memory: 37361 MB, used_memory: 25729 MB, memory_utilization: 41.9 %, available_swap_memory: 203376 MB, used_swap_memory: 23508 MB, swap_memory_utilization: 10.4 %, current_process_memory: 7714 MB 2022-09-22 14:46:47,644 - mmdet - INFO - Epoch [1][990/993] lr: 5.450e-04, eta: 32 days, 7:23:17, time: 10.488, data_time: 0.035, memory: 7714 MB,loss_cls: 10.8853, loss_conf: 20.0091, loss_xy: 7.9695, loss_wh: 3.6433, loss: 42.5073, grad_norm: 131.9860 2022-09-22 14:47:14,792 - mmdet - INFO - Saving checkpoint at 1 epochs 2022-09-22 14:57:27,539 - mmdet - INFO - +-------+-----+------+--------+-------+ | class | gts | dets | recall | ap | +-------+-----+------+--------+-------+ | 0 | 1 | 0 | 0.000 | 0.000 | | 1 | 2 | 0 | 0.000 | 0.000 | | 2 | 81 | 510 | 0.086 | 0.002 | | 3 | 1 | 0 | 0.000 | 0.000 | | 4 | 1 | 0 | 0.000 | 0.000 | | 5 | 113 | 505 | 0.018 | 0.000 | | 6 | 73 | 460 | 0.000 | 0.000 | | 7 | 149 | 518 | 0.047 | 0.001 | | 8 | 22 | 16 | 0.000 | 0.000 | | 9 | 43 | 452 | 0.000 | 0.000 | | 10 | 2 | 0 | 0.000 | 0.000 | | 11 | 1 | 0 | 0.000 | 0.000 | | 12 | 7 | 0 | 0.000 | 0.000 | | 13 | 35 | 491 | 0.114 | 0.002 | | 14 | 81 | 484 | 0.025 | 0.000 | | 15 | 38 | 428 | 0.158 | 0.004 | | 16 | 37 | 299 | 0.054 | 0.000 | | 17 | 14 | 32 | 0.000 | 0.000 | | 18 | 27 | 475 | 0.000 | 0.000 | | 19 | 9 | 1 | 0.000 | 0.000 | | 20 | 1 | 0 | 0.000 | 0.000 | | 21 | 34 | 459 | 0.000 | 0.000 | | 22 | 2 | 0 | 0.000 | 0.000 | +-------+-----+------+--------+-------+ | mAP | | | | 0.000 | +-------+-----+------+--------+-------+ 2022-09-22 14:57:28,776 - mmdet - INFO - Now best checkpoint is saved as best_mAP_epoch_1.pth. 2022-09-22 14:57:28,777 - mmdet - INFO - Best mAP is 0.0005 at 1 epoch. 2022-09-22 14:57:28,885 - mmdet - INFO - Exp name: car_dent_yolov3.py 2022-09-22 14:57:28,886 - mmdet - INFO - Epoch(val) [1][543] memory: 7714 MB,AP50: 0.0000, mAP: 0.0005 2022-09-22 14:57:28,942 - mmdet - WARNING - Please set CLASSES in the KandulaDataset andcheck if it is consistent with the num_classes of head Killed

Additional information

No response

open-mmlab / mmdetection