open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.09k stars 9.38k forks source link

In mmdetection3.0, memory keep increasing fast in the training process of DETR-like object detectors, while in mmdetection2.52.2 the memory increases slowly. #10310

Open wzmw-zr opened 1 year ago

wzmw-zr commented 1 year ago

When I train DETR-like object detectors (e.g. DETR, DINO...) in mmdetection3.0, the occupied memory of RAM will increase fast, so the training process will be killed when there is no free space in RAM. However, when I switch to mmdetection2.52.2, the occupied memory of RAM will increase slowly.

In mmdetection2.52.2, the RAM usage and other information in the training process of DETR are as follows:

2023-05-10 20:10:52,595 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.9.12 (main, Apr  5 2022, 06:56:58) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.13.1
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.6
MMDetection: 2.25.2+9d3e162
------------------------------------------------------------

2023-05-10 20:10:55,044 - mmdet - INFO - Distributed training: True
2023-05-10 20:10:57,363 - mmdet - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='AutoAugment',
        policies=[[{
            'type':
            'Resize',
            'img_scale': [(480, 1333), (512, 1333), (544, 1333), (576, 1333),
                          (608, 1333), (640, 1333), (672, 1333), (704, 1333),
                          (736, 1333), (768, 1333), (800, 1333)],
            'multiscale_mode':
            'value',
            'keep_ratio':
            True
        }],
                  [{
                      'type': 'Resize',
                      'img_scale': [(400, 1333), (500, 1333), (600, 1333)],
                      'multiscale_mode': 'value',
                      'keep_ratio': True
                  }, {
                      'type': 'RandomCrop',
                      'crop_type': 'absolute_range',
                      'crop_size': (384, 600),
                      'allow_negative_crop': True
                  }, {
                      'type':
                      'Resize',
                      'img_scale': [(480, 1333), (512, 1333), (544, 1333),
                                    (576, 1333), (608, 1333), (640, 1333),
                                    (672, 1333), (704, 1333), (736, 1333),
                                    (768, 1333), (800, 1333)],
                      'multiscale_mode':
                      'value',
                      'override':
                      True,
                      'keep_ratio':
                      True
                  }]]),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=1),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=1),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_train2017.json',
        img_prefix='data/coco/train2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='AutoAugment',
                policies=[[{
                    'type':
                    'Resize',
                    'img_scale': [(480, 1333), (512, 1333), (544, 1333),
                                  (576, 1333), (608, 1333), (640, 1333),
                                  (672, 1333), (704, 1333), (736, 1333),
                                  (768, 1333), (800, 1333)],
                    'multiscale_mode':
                    'value',
                    'keep_ratio':
                    True
                }],
                          [{
                              'type': 'Resize',
                              'img_scale': [(400, 1333), (500, 1333),
                                            (600, 1333)],
                              'multiscale_mode': 'value',
                              'keep_ratio': True
                          }, {
                              'type': 'RandomCrop',
                              'crop_type': 'absolute_range',
                              'crop_size': (384, 600),
                              'allow_negative_crop': True
                          }, {
                              'type':
                              'Resize',
                              'img_scale': [(480, 1333), (512, 1333),
                                            (544, 1333), (576, 1333),
                                            (608, 1333), (640, 1333),
                                            (672, 1333), (704, 1333),
                                            (736, 1333), (768, 1333),
                                            (800, 1333)],
                              'multiscale_mode':
                              'value',
                              'override':
                              True,
                              'keep_ratio':
                              True
                          }]]),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=1),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ]),
    val=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=1),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=1),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
evaluation = dict(interval=1, metric='bbox')
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='MemoryProfilerHook', interval=50)]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
auto_scale_lr = dict(enable=False, base_batch_size=16)
model = dict(
    type='DETR',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(3, ),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=False),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    bbox_head=dict(
        type='DETRHead',
        num_classes=80,
        in_channels=2048,
        transformer=dict(
            type='Transformer',
            encoder=dict(
                type='DetrTransformerEncoder',
                num_layers=6,
                transformerlayers=dict(
                    type='BaseTransformerLayer',
                    attn_cfgs=[
                        dict(
                            type='MultiheadAttention',
                            embed_dims=256,
                            num_heads=8,
                            dropout=0.1)
                    ],
                    feedforward_channels=2048,
                    ffn_dropout=0.1,
                    operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
            decoder=dict(
                type='DetrTransformerDecoder',
                return_intermediate=True,
                num_layers=6,
                transformerlayers=dict(
                    type='DetrTransformerDecoderLayer',
                    attn_cfgs=dict(
                        type='MultiheadAttention',
                        embed_dims=256,
                        num_heads=8,
                        dropout=0.1),
                    feedforward_channels=2048,
                    ffn_dropout=0.1,
                    operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
                                     'ffn', 'norm')))),
        positional_encoding=dict(
            type='SinePositionalEncoding', num_feats=128, normalize=True),
        loss_cls=dict(
            type='CrossEntropyLoss',
            bg_cls_weight=0.1,
            use_sigmoid=False,
            loss_weight=1.0,
            class_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=5.0),
        loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
    train_cfg=dict(
        assigner=dict(
            type='HungarianAssigner',
            cls_cost=dict(type='ClassificationCost', weight=1.0),
            reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
            iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0))),
    test_cfg=dict(max_per_img=100))
optimizer = dict(
    type='AdamW',
    lr=0.0001,
    weight_decay=0.0001,
    paramwise_cfg=dict(
        custom_keys=dict(backbone=dict(lr_mult=0.1, decay_mult=1.0))))
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
lr_config = dict(policy='step', step=[100])
runner = dict(type='EpochBasedRunner', max_epochs=150)
work_dir = './work_dirs/detr_r50_8x2_150e_coco'
auto_resume = False
gpu_ids = range(0, 4)

2023-05-10 20:10:57,363 - mmdet - INFO - Set random seed to 0, deterministic: False
2023-05-10 20:10:57,684 - mmdet - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet50'}
2023-05-10 20:10:57,685 - mmcv - INFO - load model from: torchvision://resnet50
2023-05-10 20:10:57,685 - mmcv - INFO - load checkpoint from torchvision path: torchvision://resnet50
2023-05-10 20:10:59,315 - mmcv - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=13.04s)
creating index...
Done (t=13.22s)
creating index...
Done (t=13.16s)
creating index...
Done (t=13.19s)
creating index...
index created!
index created!
index created!
index created!
2023-05-10 20:11:17,691 - mmdet - INFO - Automatic scaling of learning rate (LR) has been disabled.
loading annotations into memory...loading annotations into memory...

loading annotations into memory...
loading annotations into memory...
Done (t=0.37s)
creating index...
Done (t=0.37s)
creating index...
Done (t=0.38s)
creating index...
Done (t=0.39s)
creating index...
index created!
index created!
index created!
index created!
2023-05-10 20:11:18,181 - mmdet - INFO - Start running, host: zhaorui@L1806-1, work_dir: /home/zhaorui/CV-Code/corner_case_mmdetection/work_dirs/detr_r50_8x2_150e_coco
2023-05-10 20:11:18,181 - mmdet - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) DistSamplerSeedHook                
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) StepLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(NORMAL      ) MemoryProfilerHook                 
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(NORMAL      ) DistSamplerSeedHook                
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) MemoryProfilerHook                 
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2023-05-10 20:11:18,181 - mmdet - INFO - workflow: [('train', 1)], max: 150 epochs
2023-05-10 20:11:18,181 - mmdet - INFO - Checkpoints will be saved to /home/zhaorui/CV-Code/corner_case_mmdetection/work_dirs/detr_r50_8x2_150e_coco by HardDiskBackend.
2023-05-10 20:11:24,608 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
2023-05-10 20:11:35,244 - mmdet - INFO - Memory information available_memory: 182824 MB, used_memory: 72203 MB, memory_utilization: 29.0 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7163 MB
2023-05-10 20:11:35,254 - mmdet - INFO - Epoch [1][50/14659]    lr: 1.000e-04, eta: 8 days, 16:24:28, time: 0.341, data_time: 0.067, memory: 4392, loss_cls: 2.1938, loss_bbox: 3.9803, loss_iou: 2.5871, d0.loss_cls: 2.2074, d0.loss_bbox: 3.9588, d0.loss_iou: 2.5466, d1.loss_cls: 2.1931, d1.loss_bbox: 3.9829, d1.loss_iou: 2.5745, d2.loss_cls: 2.1761, d2.loss_bbox: 3.9786, d2.loss_iou: 2.5937, d3.loss_cls: 2.1937, d3.loss_bbox: 3.9686, d3.loss_iou: 2.6020, d4.loss_cls: 2.1737, d4.loss_bbox: 3.9452, d4.loss_iou: 2.6070, loss: 52.4630, grad_norm: 102.9133
2023-05-10 20:11:46,068 - mmdet - INFO - Memory information available_memory: 182818 MB, used_memory: 72243 MB, memory_utilization: 29.0 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7159 MB
2023-05-10 20:11:46,072 - mmdet - INFO - Epoch [1][100/14659]   lr: 1.000e-04, eta: 7 days, 2:18:29, time: 0.216, data_time: 0.006, memory: 4392, loss_cls: 1.9114, loss_bbox: 3.1279, loss_iou: 2.3006, d0.loss_cls: 1.9293, d0.loss_bbox: 3.0290, d0.loss_iou: 2.1662, d1.loss_cls: 1.9299, d1.loss_bbox: 3.0307, d1.loss_iou: 2.1783, d2.loss_cls: 1.9355, d2.loss_bbox: 3.0487, d2.loss_iou: 2.2081, d3.loss_cls: 1.9106, d3.loss_bbox: 3.1001, d3.loss_iou: 2.2756, d4.loss_cls: 1.9057, d4.loss_bbox: 3.1560, d4.loss_iou: 2.3314, loss: 43.4750, grad_norm: 152.1985
2023-05-10 20:11:56,751 - mmdet - INFO - Memory information available_memory: 182715 MB, used_memory: 72352 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7158 MB
2023-05-10 20:11:56,762 - mmdet - INFO - Epoch [1][150/14659]   lr: 1.000e-04, eta: 6 days, 13:01:41, time: 0.214, data_time: 0.006, memory: 4830, loss_cls: 2.0447, loss_bbox: 2.4544, loss_iou: 2.1047, d0.loss_cls: 2.0233, d0.loss_bbox: 2.4988, d0.loss_iou: 2.0763, d1.loss_cls: 2.0397, d1.loss_bbox: 2.4348, d1.loss_iou: 2.0512, d2.loss_cls: 2.0682, d2.loss_bbox: 2.4394, d2.loss_iou: 2.0628, d3.loss_cls: 2.0695, d3.loss_bbox: 2.4528, d3.loss_iou: 2.0829, d4.loss_cls: 2.0511, d4.loss_bbox: 2.4545, d4.loss_iou: 2.0862, loss: 39.4952, grad_norm: 259.5170
2023-05-10 20:12:07,251 - mmdet - INFO - Memory information available_memory: 182677 MB, used_memory: 72390 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7158 MB
2023-05-10 20:12:07,263 - mmdet - INFO - Epoch [1][200/14659]   lr: 1.000e-04, eta: 6 days, 5:49:50, time: 0.210, data_time: 0.006, memory: 4830, loss_cls: 1.9450, loss_bbox: 2.2313, loss_iou: 1.8283, d0.loss_cls: 1.9235, d0.loss_bbox: 2.2384, d0.loss_iou: 1.8772, d1.loss_cls: 1.9414, d1.loss_bbox: 2.1388, d1.loss_iou: 1.8145, d2.loss_cls: 1.9658, d2.loss_bbox: 2.1521, d2.loss_iou: 1.7916, d3.loss_cls: 1.9551, d3.loss_bbox: 2.1625, d3.loss_iou: 1.7988, d4.loss_cls: 1.9459, d4.loss_bbox: 2.2229, d4.loss_iou: 1.8349, loss: 35.7679, grad_norm: 323.9576
2023-05-10 20:12:18,064 - mmdet - INFO - Memory information available_memory: 182642 MB, used_memory: 72402 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7158 MB
2023-05-10 20:12:18,075 - mmdet - INFO - Epoch [1][250/14659]   lr: 1.000e-04, eta: 6 days, 2:16:32, time: 0.216, data_time: 0.006, memory: 4830, loss_cls: 2.0603, loss_bbox: 1.8460, loss_iou: 1.7571, d0.loss_cls: 2.0576, d0.loss_bbox: 1.7854, d0.loss_iou: 1.8114, d1.loss_cls: 2.0570, d1.loss_bbox: 1.7105, d1.loss_iou: 1.7802, d2.loss_cls: 2.0805, d2.loss_bbox: 1.7049, d2.loss_iou: 1.7625, d3.loss_cls: 2.0608, d3.loss_bbox: 1.7246, d3.loss_iou: 1.7406, d4.loss_cls: 2.0551, d4.loss_bbox: 1.7979, d4.loss_iou: 1.7672, loss: 33.5596, grad_norm: 339.6177
2023-05-10 20:12:28,839 - mmdet - INFO - Memory information available_memory: 182488 MB, used_memory: 72579 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7161 MB
2023-05-10 20:12:28,850 - mmdet - INFO - Epoch [1][300/14659]   lr: 1.000e-04, eta: 5 days, 23:49:38, time: 0.215, data_time: 0.006, memory: 4830, loss_cls: 1.9402, loss_bbox: 1.5102, loss_iou: 1.7017, d0.loss_cls: 1.9407, d0.loss_bbox: 1.6412, d0.loss_iou: 1.7632, d1.loss_cls: 1.9474, d1.loss_bbox: 1.5363, d1.loss_iou: 1.7276, d2.loss_cls: 1.9236, d2.loss_bbox: 1.5126, d2.loss_iou: 1.7041, d3.loss_cls: 1.9230, d3.loss_bbox: 1.4893, d3.loss_iou: 1.6908, d4.loss_cls: 1.9429, d4.loss_bbox: 1.5004, d4.loss_iou: 1.6998, loss: 31.0952, grad_norm: 336.9932
2023-05-10 20:12:39,560 - mmdet - INFO - Memory information available_memory: 182534 MB, used_memory: 72560 MB, memory_utilization: 29.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7159 MB
2023-05-10 20:12:39,571 - mmdet - INFO - Epoch [1][350/14659]   lr: 1.000e-04, eta: 5 days, 21:58:59, time: 0.214, data_time: 0.006, memory: 4830, loss_cls: 1.9630, loss_bbox: 1.5236, loss_iou: 1.7866, d0.loss_cls: 1.9710, d0.loss_bbox: 1.5336, d0.loss_iou: 1.7642, d1.loss_cls: 1.9666, d1.loss_bbox: 1.4777, d1.loss_iou: 1.7624, d2.loss_cls: 1.9678, d2.loss_bbox: 1.5017, d2.loss_iou: 1.7573, d3.loss_cls: 1.9619, d3.loss_bbox: 1.5048, d3.loss_iou: 1.7810, d4.loss_cls: 1.9643, d4.loss_bbox: 1.4852, d4.loss_iou: 1.7721, loss: 31.4447, grad_norm: 262.3158
2023-05-10 20:12:50,185 - mmdet - INFO - Memory information available_memory: 182338 MB, used_memory: 72678 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7159 MB
2023-05-10 20:12:50,196 - mmdet - INFO - Epoch [1][400/14659]   lr: 1.000e-04, eta: 5 days, 20:27:14, time: 0.213, data_time: 0.006, memory: 4830, loss_cls: 1.8691, loss_bbox: 1.4978, loss_iou: 1.7108, d0.loss_cls: 1.8743, d0.loss_bbox: 1.5374, d0.loss_iou: 1.7244, d1.loss_cls: 1.8739, d1.loss_bbox: 1.4835, d1.loss_iou: 1.6921, d2.loss_cls: 1.8913, d2.loss_bbox: 1.4706, d2.loss_iou: 1.6746, d3.loss_cls: 1.8762, d3.loss_bbox: 1.4604, d3.loss_iou: 1.6962, d4.loss_cls: 1.8708, d4.loss_bbox: 1.4814, d4.loss_iou: 1.6969, loss: 30.3819, grad_norm: 243.9912
2023-05-10 20:13:00,927 - mmdet - INFO - Memory information available_memory: 182314 MB, used_memory: 72738 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7160 MB
2023-05-10 20:13:00,938 - mmdet - INFO - Epoch [1][450/14659]   lr: 1.000e-04, eta: 5 days, 19:25:18, time: 0.215, data_time: 0.006, memory: 4830, loss_cls: 1.8927, loss_bbox: 1.3934, loss_iou: 1.7035, d0.loss_cls: 1.9322, d0.loss_bbox: 1.4574, d0.loss_iou: 1.6973, d1.loss_cls: 1.9193, d1.loss_bbox: 1.3880, d1.loss_iou: 1.6914, d2.loss_cls: 1.9064, d2.loss_bbox: 1.3733, d2.loss_iou: 1.6775, d3.loss_cls: 1.9074, d3.loss_bbox: 1.3528, d3.loss_iou: 1.6721, d4.loss_cls: 1.9046, d4.loss_bbox: 1.3834, d4.loss_iou: 1.6851, loss: 29.9380, grad_norm: 233.0428
2023-05-10 20:13:11,514 - mmdet - INFO - Memory information available_memory: 182325 MB, used_memory: 72722 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7166 MB
2023-05-10 20:13:11,531 - mmdet - INFO - Epoch [1][500/14659]   lr: 1.000e-04, eta: 5 days, 18:24:22, time: 0.212, data_time: 0.006, memory: 4830, loss_cls: 1.8838, loss_bbox: 1.3841, loss_iou: 1.7395, d0.loss_cls: 1.9165, d0.loss_bbox: 1.4486, d0.loss_iou: 1.7354, d1.loss_cls: 1.8909, d1.loss_bbox: 1.3882, d1.loss_iou: 1.7304, d2.loss_cls: 1.8895, d2.loss_bbox: 1.3480, d2.loss_iou: 1.7022, d3.loss_cls: 1.8834, d3.loss_bbox: 1.3319, d3.loss_iou: 1.6993, d4.loss_cls: 1.8815, d4.loss_bbox: 1.3598, d4.loss_iou: 1.7244, loss: 29.9373, grad_norm: 220.6434
2023-05-10 20:13:22,229 - mmdet - INFO - Memory information available_memory: 182335 MB, used_memory: 72731 MB, memory_utilization: 29.2 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7168 MB
2023-05-10 20:13:22,239 - mmdet - INFO - Epoch [1][550/14659]   lr: 1.000e-04, eta: 5 days, 17:43:01, time: 0.214, data_time: 0.007, memory: 4830, loss_cls: 1.9474, loss_bbox: 1.4369, loss_iou: 1.7117, d0.loss_cls: 1.9816, d0.loss_bbox: 1.5054, d0.loss_iou: 1.7683, d1.loss_cls: 1.9604, d1.loss_bbox: 1.4625, d1.loss_iou: 1.7264, d2.loss_cls: 1.9607, d2.loss_bbox: 1.4035, d2.loss_iou: 1.6959, d3.loss_cls: 1.9442, d3.loss_bbox: 1.4316, d3.loss_iou: 1.6960, d4.loss_cls: 1.9440, d4.loss_bbox: 1.4106, d4.loss_iou: 1.6962, loss: 30.6832, grad_norm: 191.9449
2023-05-10 20:13:32,960 - mmdet - INFO - Memory information available_memory: 182194 MB, used_memory: 72868 MB, memory_utilization: 29.3 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7169 MB
2023-05-10 20:13:32,971 - mmdet - INFO - Epoch [1][600/14659]   lr: 1.000e-04, eta: 5 days, 17:09:29, time: 0.215, data_time: 0.006, memory: 4830, loss_cls: 1.8308, loss_bbox: 1.3737, loss_iou: 1.6855, d0.loss_cls: 1.8474, d0.loss_bbox: 1.3880, d0.loss_iou: 1.6779, d1.loss_cls: 1.8536, d1.loss_bbox: 1.3698, d1.loss_iou: 1.6720, d2.loss_cls: 1.8556, d2.loss_bbox: 1.3603, d2.loss_iou: 1.6834, d3.loss_cls: 1.8424, d3.loss_bbox: 1.3493, d3.loss_iou: 1.6917, d4.loss_cls: 1.8346, d4.loss_bbox: 1.3686, d4.loss_iou: 1.7034, loss: 29.3879, grad_norm: 189.7734
2023-05-10 20:13:43,612 - mmdet - INFO - Memory information available_memory: 182169 MB, used_memory: 72878 MB, memory_utilization: 29.3 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7160 MB
2023-05-10 20:13:43,627 - mmdet - INFO - Epoch [1][650/14659]   lr: 1.000e-04, eta: 5 days, 16:36:41, time: 0.213, data_time: 0.006, memory: 4830, loss_cls: 1.8788, loss_bbox: 1.4147, loss_iou: 1.7250, d0.loss_cls: 1.8995, d0.loss_bbox: 1.4720, d0.loss_iou: 1.7449, d1.loss_cls: 1.8870, d1.loss_bbox: 1.4685, d1.loss_iou: 1.7655, d2.loss_cls: 1.9009, d2.loss_bbox: 1.4162, d2.loss_iou: 1.7115, d3.loss_cls: 1.8853, d3.loss_bbox: 1.3990, d3.loss_iou: 1.7044, d4.loss_cls: 1.8818, d4.loss_bbox: 1.3866, d4.loss_iou: 1.7156, loss: 30.2570, grad_norm: 180.2028

In mmdetection3.0.0, the RAM usage and other information in the training process of DETR are as follows:

05/10 20:31:25 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.9.16 (main, Mar  8 2023, 14:00:05) [GCC 11.2.0]
    CUDA available: True
    numpy_random_seed: 1656719191
    GPU 0,1,2,3: NVIDIA GeForce RTX 3090
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 11.6, V11.6.124
    GCC: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0
    PyTorch: 2.0.0.post200
    PyTorch compiling details: PyTorch built with:
  - GCC 10.4
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - Built with CUDA Runtime 11.2
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
  - CuDNN 8.4.1  (built against CUDA 11.6)
  - Magma 2.7.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.2, CUDNN_VERSION=8.4.1, CXX_COMPILER=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/_build_env/bin/x86_64-conda-linux-gnu-c++, CXX_FLAGS=-std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/work=/usr/local/src/conda/pytorch-2.0.0 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680527322149/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -Wno-deprecated-declarations -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=1, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.13.1a0
    OpenCV: 4.7.0
    MMEngine: 0.7.0

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: None
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 4
------------------------------------------------------------

05/10 20:31:27 - mmengine - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
backend_args = None
train_pipeline = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='RandomFlip', prob=0.5),
    dict(
        type='RandomChoice',
        transforms=[[{
            'type':
            'RandomChoiceResize',
            'scales': [(480, 1333), (512, 1333), (544, 1333), (576, 1333),
                       (608, 1333), (640, 1333), (672, 1333), (704, 1333),
                       (736, 1333), (768, 1333), (800, 1333)],
            'keep_ratio':
            True
        }],
                    [{
                        'type': 'RandomChoiceResize',
                        'scales': [(400, 1333), (500, 1333), (600, 1333)],
                        'keep_ratio': True
                    }, {
                        'type': 'RandomCrop',
                        'crop_type': 'absolute_range',
                        'crop_size': (384, 600),
                        'allow_negative_crop': True
                    }, {
                        'type':
                        'RandomChoiceResize',
                        'scales': [(480, 1333), (512, 1333), (544, 1333),
                                   (576, 1333), (608, 1333), (640, 1333),
                                   (672, 1333), (704, 1333), (736, 1333),
                                   (768, 1333), (800, 1333)],
                        'keep_ratio':
                        True
                    }]]),
    dict(type='PackDetInputs')
]
test_pipeline = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(type='Resize', scale=(1333, 800), keep_ratio=True),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                   'scale_factor'))
]
train_dataloader = dict(
    batch_size=2,
    num_workers=0,
    persistent_workers=False,
    sampler=dict(type='DefaultSampler', shuffle=True),
    batch_sampler=dict(type='AspectRatioBatchSampler'),
    dataset=dict(
        type='CocoDataset',
        data_root='data/coco/',
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=[
            dict(type='LoadImageFromFile', backend_args=None),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='RandomFlip', prob=0.5),
            dict(
                type='RandomChoice',
                transforms=[[{
                    'type':
                    'RandomChoiceResize',
                    'scales': [(480, 1333), (512, 1333), (544, 1333),
                               (576, 1333), (608, 1333), (640, 1333),
                               (672, 1333), (704, 1333), (736, 1333),
                               (768, 1333), (800, 1333)],
                    'keep_ratio':
                    True
                }],
                            [{
                                'type': 'RandomChoiceResize',
                                'scales': [(400, 1333), (500, 1333),
                                           (600, 1333)],
                                'keep_ratio': True
                            }, {
                                'type': 'RandomCrop',
                                'crop_type': 'absolute_range',
                                'crop_size': (384, 600),
                                'allow_negative_crop': True
                            }, {
                                'type':
                                'RandomChoiceResize',
                                'scales':
                                [(480, 1333), (512, 1333), (544, 1333),
                                 (576, 1333), (608, 1333), (640, 1333),
                                 (672, 1333), (704, 1333), (736, 1333),
                                 (768, 1333), (800, 1333)],
                                'keep_ratio':
                                True
                            }]]),
            dict(type='PackDetInputs')
        ],
        backend_args=None))
val_dataloader = dict(
    batch_size=1,
    num_workers=0,
    persistent_workers=False,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='CocoDataset',
        data_root='data/coco/',
        ann_file='annotations/instances_val2017.json',
        data_prefix=dict(img='val2017/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile', backend_args=None),
            dict(type='Resize', scale=(1333, 800), keep_ratio=True),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(
                type='PackDetInputs',
                meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                           'scale_factor'))
        ],
        backend_args=None))
test_dataloader = dict(
    batch_size=1,
    num_workers=0,
    persistent_workers=False,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='CocoDataset',
        data_root='data/coco/',
        ann_file='annotations/instances_val2017.json',
        data_prefix=dict(img='val2017/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile', backend_args=None),
            dict(type='Resize', scale=(1333, 800), keep_ratio=True),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(
                type='PackDetInputs',
                meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                           'scale_factor'))
        ],
        backend_args=None))
val_evaluator = dict(
    type='CocoMetric',
    ann_file='data/coco/annotations/instances_val2017.json',
    metric='bbox',
    format_only=False,
    backend_args=None)
test_evaluator = dict(
    type='CocoMetric',
    ann_file='data/coco/annotations/instances_val2017.json',
    metric='bbox',
    format_only=False,
    backend_args=None)
default_scope = 'mmdet'
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', interval=1),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='DetVisualizationHook'))
env_cfg = dict(
    cudnn_benchmark=False,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'))
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='DetLocalVisualizer',
    vis_backends=[dict(type='LocalVisBackend')],
    name='visualizer')
log_processor = dict(type='LogProcessor', window_size=50, by_epoch=True)
log_level = 'INFO'
load_from = None
resume = False
model = dict(
    type='DETR',
    num_queries=100,
    data_preprocessor=dict(
        type='DetDataPreprocessor',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        bgr_to_rgb=True,
        pad_size_divisor=1),
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(3, ),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=False),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='ChannelMapper',
        in_channels=[2048],
        kernel_size=1,
        out_channels=256,
        act_cfg=None,
        norm_cfg=None,
        num_outs=1),
    encoder=dict(
        num_layers=6,
        layer_cfg=dict(
            self_attn_cfg=dict(
                embed_dims=256, num_heads=8, dropout=0.1, batch_first=True),
            ffn_cfg=dict(
                embed_dims=256,
                feedforward_channels=2048,
                num_fcs=2,
                ffn_drop=0.1,
                act_cfg=dict(type='ReLU', inplace=True)))),
    decoder=dict(
        num_layers=6,
        layer_cfg=dict(
            self_attn_cfg=dict(
                embed_dims=256, num_heads=8, dropout=0.1, batch_first=True),
            cross_attn_cfg=dict(
                embed_dims=256, num_heads=8, dropout=0.1, batch_first=True),
            ffn_cfg=dict(
                embed_dims=256,
                feedforward_channels=2048,
                num_fcs=2,
                ffn_drop=0.1,
                act_cfg=dict(type='ReLU', inplace=True))),
        return_intermediate=True),
    positional_encoding=dict(num_feats=128, normalize=True),
    bbox_head=dict(
        type='DETRHead',
        num_classes=80,
        embed_dims=256,
        loss_cls=dict(
            type='CrossEntropyLoss',
            bg_cls_weight=0.1,
            use_sigmoid=False,
            loss_weight=1.0,
            class_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=5.0),
        loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
    train_cfg=dict(
        assigner=dict(
            type='HungarianAssigner',
            match_costs=[
                dict(type='ClassificationCost', weight=1.0),
                dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
                dict(type='IoUCost', iou_mode='giou', weight=2.0)
            ])),
    test_cfg=dict(max_per_img=100))
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=0.0001, weight_decay=0.0001),
    clip_grad=dict(max_norm=0.1, norm_type=2),
    paramwise_cfg=dict(
        custom_keys=dict(backbone=dict(lr_mult=0.1, decay_mult=1.0))))
max_epochs = 150
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=150, val_interval=1)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
param_scheduler = [
    dict(
        type='MultiStepLR',
        begin=0,
        end=150,
        by_epoch=True,
        milestones=[100],
        gamma=0.1)
]
auto_scale_lr = dict(base_batch_size=16)
custom_hooks = [dict(type='MemoryProfilerHook', interval=50)]
launcher = 'pytorch'
work_dir = './work_dirs/detr_r50_8xb2-150e_coco'

05/10 20:31:29 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) MemoryProfilerHook                 
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DetVisualizationHook               
(NORMAL      ) MemoryProfilerHook                 
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DetVisualizationHook               
(NORMAL      ) MemoryProfilerHook                 
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_run:
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
loading annotations into memory...
loading annotations into memory...loading annotations into memory...

loading annotations into memory...
Done (t=13.93s)
creating index...
Done (t=14.04s)
creating index...
Done (t=14.10s)
creating index...
index created!
index created!
index created!
Done (t=14.40s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
Done (t=0.51s)
creating index...
index created!
Done (t=0.55s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.53s)
creating index...
index created!
Done (t=0.53s)
creating index...
index created!
Done (t=0.56s)
creating index...
index created!
loading annotations into memory...
Done (t=0.50s)
creating index...
index created!
loading annotations into memory...
Done (t=0.50s)
creating index...
index created!
05/10 20:32:06 - mmengine - INFO - load model from: torchvision://resnet50
05/10 20:32:06 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50
05/10 20:32:06 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

05/10 20:32:06 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
05/10 20:32:06 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
05/10 20:32:06 - mmengine - INFO - Checkpoints will be saved to /home/zhaorui/CV-Code/mmdetection/work_dirs/detr_r50_8xb2-150e_coco.
05/10 20:32:22 - mmengine - INFO - Memory information available_memory: 180559 MB, used_memory: 74663 MB, memory_utilization: 29.9 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 7954 MB
05/10 20:32:22 - mmengine - INFO - Epoch(train)   [1][   50/14659]  lr: 1.0000e-04  eta: 8 days, 3:16:02  time: 0.3197  data_time: 0.0194  memory: 3969  grad_norm: 103.5423  loss: 54.9896  loss_cls: 2.1978  loss_bbox: 4.1494  loss_iou: 2.8131  d0.loss_cls: 2.2200  d0.loss_bbox: 4.1326  d0.loss_iou: 2.8068  d1.loss_cls: 2.2382  d1.loss_bbox: 4.1817  d1.loss_iou: 2.7990  d2.loss_cls: 2.2061  d2.loss_bbox: 4.1539  d2.loss_iou: 2.8136  d3.loss_cls: 2.2067  d3.loss_bbox: 4.1143  d3.loss_iou: 2.8101  d4.loss_cls: 2.1981  d4.loss_bbox: 4.1250  d4.loss_iou: 2.8234
05/10 20:32:35 - mmengine - INFO - Memory information available_memory: 179166 MB, used_memory: 76056 MB, memory_utilization: 30.4 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 8327 MB
05/10 20:32:35 - mmengine - INFO - Epoch(train)   [1][  100/14659]  lr: 1.0000e-04  eta: 7 days, 8:54:15  time: 0.2596  data_time: 0.0212  memory: 3839  grad_norm: 220.0322  loss: 43.2984  loss_cls: 1.9355  loss_bbox: 2.9958  loss_iou: 2.3072  d0.loss_cls: 1.9224  d0.loss_bbox: 2.9365  d0.loss_iou: 2.2635  d1.loss_cls: 1.9287  d1.loss_bbox: 2.9846  d1.loss_iou: 2.2853  d2.loss_cls: 1.9232  d2.loss_bbox: 3.0551  d2.loss_iou: 2.3758  d3.loss_cls: 1.9075  d3.loss_bbox: 2.9419  d3.loss_iou: 2.2876  d4.loss_cls: 1.9244  d4.loss_bbox: 2.9972  d4.loss_iou: 2.3263
05/10 20:32:48 - mmengine - INFO - Memory information available_memory: 177814 MB, used_memory: 77409 MB, memory_utilization: 31.0 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 8696 MB
05/10 20:32:48 - mmengine - INFO - Epoch(train)   [1][  150/14659]  lr: 1.0000e-04  eta: 7 days, 2:31:53  time: 0.2584  data_time: 0.0214  memory: 3699  grad_norm: 504.9538  loss: 34.0111  loss_cls: 2.0356  loss_bbox: 2.0061  loss_iou: 1.6414  d0.loss_cls: 2.0125  d0.loss_bbox: 2.0345  d0.loss_iou: 1.6714  d1.loss_cls: 2.0591  d1.loss_bbox: 1.9648  d1.loss_iou: 1.6293  d2.loss_cls: 2.0408  d2.loss_bbox: 1.9693  d2.loss_iou: 1.6279  d3.loss_cls: 2.0374  d3.loss_bbox: 1.9823  d3.loss_iou: 1.6434  d4.loss_cls: 2.0304  d4.loss_bbox: 1.9886  d4.loss_iou: 1.6364
05/10 20:33:01 - mmengine - INFO - Memory information available_memory: 176527 MB, used_memory: 78695 MB, memory_utilization: 31.5 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9010 MB
05/10 20:33:01 - mmengine - INFO - Epoch(train)   [1][  200/14659]  lr: 1.0000e-04  eta: 6 days, 23:42:32  time: 0.2608  data_time: 0.0210  memory: 3932  grad_norm: 452.9922  loss: 30.2156  loss_cls: 1.9281  loss_bbox: 1.6330  loss_iou: 1.5132  d0.loss_cls: 1.9302  d0.loss_bbox: 1.6076  d0.loss_iou: 1.5465  d1.loss_cls: 1.9314  d1.loss_bbox: 1.5242  d1.loss_iou: 1.5450  d2.loss_cls: 1.9316  d2.loss_bbox: 1.5356  d2.loss_iou: 1.5310  d3.loss_cls: 1.9375  d3.loss_bbox: 1.5483  d3.loss_iou: 1.5167  d4.loss_cls: 1.9255  d4.loss_bbox: 1.5909  d4.loss_iou: 1.5394
05/10 20:33:14 - mmengine - INFO - Memory information available_memory: 175384 MB, used_memory: 79839 MB, memory_utilization: 31.9 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9274 MB
05/10 20:33:14 - mmengine - INFO - Epoch(train)   [1][  250/14659]  lr: 1.0000e-04  eta: 6 days, 21:07:34  time: 0.2535  data_time: 0.0206  memory: 3988  grad_norm: 400.0951  loss: 28.9441  loss_cls: 1.8622  loss_bbox: 1.4915  loss_iou: 1.4709  d0.loss_cls: 1.8810  d0.loss_bbox: 1.5435  d0.loss_iou: 1.5183  d1.loss_cls: 1.8855  d1.loss_bbox: 1.4720  d1.loss_iou: 1.4749  d2.loss_cls: 1.8688  d2.loss_bbox: 1.4429  d2.loss_iou: 1.4723  d3.loss_cls: 1.8796  d3.loss_bbox: 1.4478  d3.loss_iou: 1.4748  d4.loss_cls: 1.8615  d4.loss_bbox: 1.4564  d4.loss_iou: 1.4403
05/10 20:33:26 - mmengine - INFO - Memory information available_memory: 174363 MB, used_memory: 80859 MB, memory_utilization: 32.3 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9527 MB
05/10 20:33:26 - mmengine - INFO - Epoch(train)   [1][  300/14659]  lr: 1.0000e-04  eta: 6 days, 18:58:01  time: 0.2492  data_time: 0.0205  memory: 3686  grad_norm: 338.3264  loss: 29.3254  loss_cls: 1.8864  loss_bbox: 1.4218  loss_iou: 1.6279  d0.loss_cls: 1.9069  d0.loss_bbox: 1.4351  d0.loss_iou: 1.5304  d1.loss_cls: 1.9050  d1.loss_bbox: 1.3700  d1.loss_iou: 1.5609  d2.loss_cls: 1.8926  d2.loss_bbox: 1.4070  d2.loss_iou: 1.6077  d3.loss_cls: 1.9016  d3.loss_bbox: 1.3791  d3.loss_iou: 1.6020  d4.loss_cls: 1.8913  d4.loss_bbox: 1.4098  d4.loss_iou: 1.5897
05/10 20:33:39 - mmengine - INFO - Memory information available_memory: 173251 MB, used_memory: 81971 MB, memory_utilization: 32.7 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 9795 MB
05/10 20:33:39 - mmengine - INFO - Epoch(train)   [1][  350/14659]  lr: 1.0000e-04  eta: 6 days, 17:44:55  time: 0.2529  data_time: 0.0215  memory: 3699  grad_norm: 339.1143  loss: 34.1119  loss_cls: 1.9958  loss_bbox: 1.5808  loss_iou: 1.9624  d0.loss_cls: 2.0374  d0.loss_bbox: 1.7025  d0.loss_iou: 2.0307  d1.loss_cls: 2.0513  d1.loss_bbox: 1.6896  d1.loss_iou: 2.0143  d2.loss_cls: 2.0113  d2.loss_bbox: 1.6945  d2.loss_iou: 2.0548  d3.loss_cls: 2.0207  d3.loss_bbox: 1.6315  d3.loss_iou: 2.0241  d4.loss_cls: 2.0149  d4.loss_bbox: 1.6224  d4.loss_iou: 1.9731
05/10 20:33:51 - mmengine - INFO - Memory information available_memory: 172380 MB, used_memory: 82842 MB, memory_utilization: 33.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10014 MB
05/10 20:33:51 - mmengine - INFO - Epoch(train)   [1][  400/14659]  lr: 1.0000e-04  eta: 6 days, 16:20:14  time: 0.2464  data_time: 0.0211  memory: 3687  grad_norm: 275.3709  loss: 32.7396  loss_cls: 1.9504  loss_bbox: 1.6224  loss_iou: 1.8994  d0.loss_cls: 1.9714  d0.loss_bbox: 1.6036  d0.loss_iou: 1.9065  d1.loss_cls: 1.9478  d1.loss_bbox: 1.6063  d1.loss_iou: 1.8971  d2.loss_cls: 1.9491  d2.loss_bbox: 1.5973  d2.loss_iou: 1.8940  d3.loss_cls: 1.9593  d3.loss_bbox: 1.5845  d3.loss_iou: 1.9143  d4.loss_cls: 1.9638  d4.loss_bbox: 1.5691  d4.loss_iou: 1.9033
05/10 20:34:04 - mmengine - INFO - Memory information available_memory: 171473 MB, used_memory: 83749 MB, memory_utilization: 33.4 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10255 MB
05/10 20:34:04 - mmengine - INFO - Epoch(train)   [1][  450/14659]  lr: 1.0000e-04  eta: 6 days, 15:17:40  time: 0.2472  data_time: 0.0215  memory: 3699  grad_norm: 227.7132  loss: 28.8479  loss_cls: 1.7894  loss_bbox: 1.4295  loss_iou: 1.6153  d0.loss_cls: 1.7946  d0.loss_bbox: 1.5115  d0.loss_iou: 1.6341  d1.loss_cls: 1.7866  d1.loss_bbox: 1.3678  d1.loss_iou: 1.5725  d2.loss_cls: 1.7963  d2.loss_bbox: 1.3832  d2.loss_iou: 1.5834  d3.loss_cls: 1.8030  d3.loss_bbox: 1.4149  d3.loss_iou: 1.5845  d4.loss_cls: 1.8074  d4.loss_bbox: 1.3846  d4.loss_iou: 1.5894
05/10 20:34:16 - mmengine - INFO - Memory information available_memory: 170569 MB, used_memory: 84653 MB, memory_utilization: 33.8 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10492 MB
05/10 20:34:16 - mmengine - INFO - Epoch(train)   [1][  500/14659]  lr: 1.0000e-04  eta: 6 days, 14:26:44  time: 0.2470  data_time: 0.0212  memory: 3543  grad_norm: 230.8249  loss: 29.3416  loss_cls: 1.9113  loss_bbox: 1.4269  loss_iou: 1.5849  d0.loss_cls: 1.9161  d0.loss_bbox: 1.4668  d0.loss_iou: 1.5625  d1.loss_cls: 1.9111  d1.loss_bbox: 1.3832  d1.loss_iou: 1.5550  d2.loss_cls: 1.9230  d2.loss_bbox: 1.3968  d2.loss_iou: 1.5339  d3.loss_cls: 1.9316  d3.loss_bbox: 1.3564  d3.loss_iou: 1.5530  d4.loss_cls: 1.9373  d4.loss_bbox: 1.4169  d4.loss_iou: 1.5750
05/10 20:34:29 - mmengine - INFO - Memory information available_memory: 169724 MB, used_memory: 85498 MB, memory_utilization: 34.1 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10657 MB
05/10 20:34:29 - mmengine - INFO - Epoch(train)   [1][  550/14659]  lr: 1.0000e-04  eta: 6 days, 13:59:01  time: 0.2512  data_time: 0.0205  memory: 3687  grad_norm: 200.3849  loss: 32.0705  loss_cls: 2.0306  loss_bbox: 1.5291  loss_iou: 1.8789  d0.loss_cls: 2.0774  d0.loss_bbox: 1.4849  d0.loss_iou: 1.8166  d1.loss_cls: 2.0511  d1.loss_bbox: 1.4676  d1.loss_iou: 1.8464  d2.loss_cls: 2.0474  d2.loss_bbox: 1.4644  d2.loss_iou: 1.7998  d3.loss_cls: 2.0448  d3.loss_bbox: 1.4416  d3.loss_iou: 1.7963  d4.loss_cls: 2.0288  d4.loss_bbox: 1.4541  d4.loss_iou: 1.8108
05/10 20:34:42 - mmengine - INFO - Memory information available_memory: 169022 MB, used_memory: 86200 MB, memory_utilization: 34.4 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10780 MB
05/10 20:34:42 - mmengine - INFO - Epoch(train)   [1][  600/14659]  lr: 1.0000e-04  eta: 6 days, 14:24:42  time: 0.2672  data_time: 0.0204  memory: 3698  grad_norm: 203.3744  loss: 29.0109  loss_cls: 1.8426  loss_bbox: 1.3323  loss_iou: 1.6706  d0.loss_cls: 1.8724  d0.loss_bbox: 1.3356  d0.loss_iou: 1.6584  d1.loss_cls: 1.8487  d1.loss_bbox: 1.3364  d1.loss_iou: 1.6912  d2.loss_cls: 1.8483  d2.loss_bbox: 1.2971  d2.loss_iou: 1.6323  d3.loss_cls: 1.8533  d3.loss_bbox: 1.3045  d3.loss_iou: 1.6620  d4.loss_cls: 1.8347  d4.loss_bbox: 1.3091  d4.loss_iou: 1.6814
05/10 20:34:54 - mmengine - INFO - Memory information available_memory: 168241 MB, used_memory: 86981 MB, memory_utilization: 34.7 %, available_swap_memory: 74 MB, used_swap_memory: 1974 MB, swap_memory_utilization: 96.4 %, current_process_memory: 10965 MB
05/10 20:34:54 - mmengine - INFO - Epoch(train)   [1][  650/14659]  lr: 1.0000e-04  eta: 6 days, 13:33:11  time: 0.2412  data_time: 0.0206  memory: 3756  grad_norm: 192.4306  loss: 34.5331  loss_cls: 2.1073  loss_bbox: 1.7186  loss_iou: 2.0007  d0.loss_cls: 2.1383  d0.loss_bbox: 1.6822  d0.loss_iou: 1.9588  d1.loss_cls: 2.1329  d1.loss_bbox: 1.6375  d1.loss_iou: 1.9499  d2.loss_cls: 2.1282  d2.loss_bbox: 1.6828  d2.loss_iou: 1.9731  d3.loss_cls: 2.1248  d3.loss_bbox: 1.6090  d3.loss_iou: 1.9580  d4.loss_cls: 2.1133  d4.loss_bbox: 1.6562  d4.loss_iou: 1.9615
Majiawei commented 1 year ago

I have also discovered this problem. Have you solved it?

mypydl commented 1 year ago

I have also discovered this problem. Have you solved it? After 24 hours of training (DETR) on 8*2080Ti, the memory useage was over 400GB!

mypydl commented 1 year ago

I have also discovered this problem with pytorch version 2.0. Have you solved it? After 24 hours of training (DETR) on 8*2080Ti, the memory useage was over 400GB!

When I use pytorch version 1.13, the memory doesn't overflow anymore.

jinlovespho commented 1 year ago

WOW!! thank you @mypydl !!

You're a life saver!! :D

developWmark commented 8 months ago

我发现了是RandomCrop的问题,类detr中数据增强使用了RandomCrop会引起CPU内存泄露,我去掉就没有这个情况了。