open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.16k stars 9.39k forks source link

[Bug] build_dp will change the training performance. #9754

Open AArchLichKing opened 1 year ago

AArchLichKing commented 1 year ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmdetection

Environment

UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
sys.platform: linux
Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA RTX A5000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.13.1+cu116
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.14.1+cu116
OpenCV: 4.7.0
MMCV: 1.7.1
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.6
MMDetection: 2.28.1+c14dd6c

Reproduces the problem - code sample

import copy
import torch
import os.path as osp

import mmcv
import numpy as np

from mmdet.datasets.builder import DATASETS
from mmdet.datasets.custom import CustomDataset

@DATASETS.register_module()
class KittiTinyDataset(CustomDataset):

    CLASSES = ('Car', 'Pedestrian', 'Cyclist')

    def load_annotations(self, ann_file):
        cat2label = {k: i for i, k in enumerate(self.CLASSES)}
        # load image list from file
        image_list = mmcv.list_from_file(self.ann_file)

        data_infos = []
        # convert annotations to middle format
        for image_id in image_list:
            filename = f'{self.img_prefix}/{image_id}.jpeg'
            image = mmcv.imread(filename)
            height, width = image.shape[:2]

            data_info = dict(filename=f'{image_id}.jpeg', width=width, height=height)

            # load annotations
            label_prefix = self.img_prefix.replace('image_2', 'label_2')
            lines = mmcv.list_from_file(osp.join(label_prefix, f'{image_id}.txt'))

            content = [line.strip().split(' ') for line in lines]
            bbox_names = [x[0] for x in content]
            bboxes = [[float(info) for info in x[4:8]] for x in content]

            gt_bboxes = []
            gt_labels = []
            gt_bboxes_ignore = []
            gt_labels_ignore = []

            # filter 'DontCare'
            for bbox_name, bbox in zip(bbox_names, bboxes):
                if bbox_name in cat2label:
                    gt_labels.append(cat2label[bbox_name])
                    gt_bboxes.append(bbox)
                else:
                    gt_labels_ignore.append(-1)
                    gt_bboxes_ignore.append(bbox)

            data_anno = dict(
                bboxes=np.array(gt_bboxes, dtype=np.float32).reshape(-1, 4),
                labels=np.array(gt_labels, dtype=np.long),
                bboxes_ignore=np.array(gt_bboxes_ignore,
                                       dtype=np.float32).reshape(-1, 4),
                labels_ignore=np.array(gt_labels_ignore, dtype=np.long))

            data_info.update(ann=data_anno)
            data_infos.append(data_info)

        return data_infos

from mmcv import Config
cfg = Config.fromfile('./configs/ssd/ssd300_coco.py')

from mmdet.apis import set_random_seed

# Modify dataset type and path
cfg.dataset_type = 'KittiTinyDataset'
cfg.data_root = '../data/kitti_tiny/'

img_norm_cfg = dict(
    mean=[103.530, 116.280, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False)

train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='Resize',
        img_scale=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
                   (1333, 768), (1333, 800)],
        multiscale_mode='value',
        keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

cfg.data = data = dict(
                        train=dict(pipeline=train_pipeline),
                        val=dict(pipeline=test_pipeline),
                        test=dict(pipeline=test_pipeline))

cfg.data.test.type = 'KittiTinyDataset'
cfg.data.test.data_root = '../data/kitti_tiny/'
cfg.data.test.ann_file = 'train.txt'
cfg.data.test.img_prefix = 'training/image_2'

cfg.data.train.type = 'KittiTinyDataset'
cfg.data.train.data_root = '../data/kitti_tiny/'
cfg.data.train.ann_file = 'train.txt'
cfg.data.train.img_prefix = 'training/image_2'

cfg.data.val.type = 'KittiTinyDataset'
cfg.data.val.data_root = '../data/kitti_tiny/'
cfg.data.val.ann_file = 'val.txt'
cfg.data.val.img_prefix = 'training/image_2'

# modify num classes of the model in box head
cfg.model.bbox_head.num_classes = 3
# If we need to finetune a model based on a pre-trained detector, we need to
# use load_from to set the path of checkpoints.

cfg.load_from = 'checkpoints/ssd300_coco_20210803_015428-d231a06e.pth'

# Set up working dir to save files and logs.
cfg.work_dir = './tutorial_exps'

# The original learning rate (LR) is set for 8-GPU training.
# We divide it by 8 since we only use one GPU.
cfg.optimizer = dict(type='SGD', lr=2e-3, momentum=0.9, weight_decay=5e-4)
cfg.optimizer.lr = 2e-3 / 8
cfg.lr_config.warmup = None
cfg.log_config.interval = 10

# Change the evaluation metric since we use customized dataset.
cfg.evaluation.metric = 'mAP'
# We can set the evaluation interval to reduce the evaluation times
cfg.evaluation.interval = 12
# We can set the checkpoint saving interval to reduce the storage cost
cfg.checkpoint_config.interval = 12

# Set seed thus the results are more reproducible
cfg.seed = 0
set_random_seed(0, deterministic=False)
cfg.device = 'cuda'
cfg.gpu_ids = range(6,7) #assign gpu ids

# We can also use tensorboard to log the training process
cfg.log_config.hooks = [
    dict(type='TextLoggerHook'),
    dict(type='TensorboardLoggerHook')]

# We can initialize the logger for training and have a look
# at the final config used for training
print(f'Config:\n{cfg.pretty_text}')

from mmdet.datasets import build_dataset, build_dataloader
from mmdet.models import build_detector
from mmdet.apis import train_detector
from mmdet.apis import single_gpu_test
from mmcv.runner.checkpoint import get_state_dict, weights_to_cpu, load_state_dict

# Build dataset
datasets = [build_dataset(cfg.data.train)]

# Build the detector
model = build_detector(cfg.model)
# Add an attribute for visualization convenience
model.CLASSES = datasets[0].CLASSES
# Create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))

print('='*86)
print('\nEvaluation of Init Model\n')
from mmdet.datasets import replace_ImageToTensor
# Replace 'ImageToTensor' to 'DefaultFormatBundle'
cfg.data.val.pipeline = replace_ImageToTensor(cfg.data.val.pipeline)

# ======== Make validation loader ==========
val_dataloader_default_args = dict(
    samples_per_gpu=1,
    workers_per_gpu=2,
    dist=False, 
    shuffle=False,
    persistent_workers=False)

val_dataloader_args = {
    **val_dataloader_default_args,
    **cfg.data.get('val_dataloader', {})
}

eval_cfg = cfg.get('evaluation', {})
eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'

val_dataset = build_dataset(cfg.data.val, dict(test_mode=True))
val_dataloader = build_dataloader(val_dataset, **val_dataloader_args)
# ==========================================

from mmdet.utils import build_dp, get_root_logger
model = build_dp(model, device='cuda', device_ids=[6])
logger = get_root_logger(log_level=cfg.log_level)
results = single_gpu_test(model, val_dataloader, show=False)
val_dataloader.dataset.evaluate(results, logger=logger, metric='mAP')

print('\n'+'='*86)
model.train()
train_detector(model, datasets, cfg, distributed=False, validate=True)

Reproduces the problem - command or script

python run.py

Reproduces the problem - error message

The training performs very bad when using build_dp, here are the logs.

2023-02-10 13:18:28,821 - mmdet - INFO - Epoch [1][10/25]       lr: 2.500e-04, eta: 0:04:25, time: 0.450, data_time: 0.229, memory: 2112, loss_cls: 5.5951, loss_bbox: 2.7488, loss: 8.3439
2023-02-10 13:18:30,008 - mmdet - INFO - Epoch [1][20/25]       lr: 2.500e-04, eta: 0:02:45, time: 0.119, data_time: 0.012, memory: 2112, loss_cls: 5.5678, loss_bbox: 2.7802, loss: 8.3480
2023-02-10 13:18:34,091 - mmdet - INFO - Epoch [2][10/25]       lr: 2.500e-04, eta: 0:02:26, time: 0.337, data_time: 0.230, memory: 2112, loss_cls: 5.5473, loss_bbox: 2.7082, loss: 8.2555
2023-02-10 13:18:35,267 - mmdet - INFO - Epoch [2][20/25]       lr: 2.500e-04, eta: 0:02:06, time: 0.118, data_time: 0.011, memory: 2112, loss_cls: 5.5339, loss_bbox: 2.7852, loss: 8.3191
2023-02-10 13:18:39,422 - mmdet - INFO - Epoch [3][10/25]       lr: 2.500e-04, eta: 0:02:02, time: 0.340, data_time: 0.231, memory: 2112, loss_cls: 5.5162, loss_bbox: 2.7797, loss: 8.2960
2023-02-10 13:18:40,625 - mmdet - INFO - Epoch [3][20/25]       lr: 2.500e-04, eta: 0:01:52, time: 0.120, data_time: 0.011, memory: 2112, loss_cls: 5.5078, loss_bbox: 2.8012, loss: 8.3090
2023-02-10 13:18:44,716 - mmdet - INFO - Epoch [4][10/25]       lr: 2.500e-04, eta: 0:01:50, time: 0.339, data_time: 0.230, memory: 2112, loss_cls: 5.4929, loss_bbox: 2.9254, loss: 8.4183
2023-02-10 13:18:45,923 - mmdet - INFO - Epoch [4][20/25]       lr: 2.500e-04, eta: 0:01:43, time: 0.121, data_time: 0.011, memory: 2112, loss_cls: 5.4821, loss_bbox: 2.7380, loss: 8.2201
2023-02-10 13:18:50,096 - mmdet - INFO - Epoch [5][10/25]       lr: 2.500e-04, eta: 0:01:41, time: 0.338, data_time: 0.229, memory: 2112, loss_cls: 5.4683, loss_bbox: 2.7190, loss: 8.1873
2023-02-10 13:18:51,325 - mmdet - INFO - Epoch [5][20/25]       lr: 2.500e-04, eta: 0:01:36, time: 0.123, data_time: 0.012, memory: 2112, loss_cls: 5.4628, loss_bbox: 2.8459, loss: 8.3087
2023-02-10 13:18:55,487 - mmdet - INFO - Epoch [6][10/25]       lr: 2.500e-04, eta: 0:01:34, time: 0.344, data_time: 0.235, memory: 2112, loss_cls: 5.4438, loss_bbox: 2.6463, loss: 8.0901
2023-02-10 13:18:56,690 - mmdet - INFO - Epoch [6][20/25]       lr: 2.500e-04, eta: 0:01:29, time: 0.121, data_time: 0.012, memory: 2112, loss_cls: 5.4377, loss_bbox: 2.7409, loss: 8.1786
2023-02-10 13:19:00,872 - mmdet - INFO - Epoch [7][10/25]       lr: 2.500e-04, eta: 0:01:28, time: 0.336, data_time: 0.230, memory: 2112, loss_cls: 5.4210, loss_bbox: 2.8588, loss: 8.2798
2023-02-10 13:19:02,068 - mmdet - INFO - Epoch [7][20/25]       lr: 2.500e-04, eta: 0:01:24, time: 0.120, data_time: 0.013, memory: 2112, loss_cls: 5.4109, loss_bbox: 2.7249, loss: 8.1359
2023-02-10 13:19:06,202 - mmdet - INFO - Epoch [8][10/25]       lr: 2.500e-04, eta: 0:01:22, time: 0.336, data_time: 0.230, memory: 2112, loss_cls: 5.3946, loss_bbox: 2.7256, loss: 8.1203
2023-02-10 13:19:07,396 - mmdet - INFO - Epoch [8][20/25]       lr: 2.500e-04, eta: 0:01:18, time: 0.119, data_time: 0.012, memory: 2112, loss_cls: 5.3880, loss_bbox: 2.8321, loss: 8.2200
2023-02-10 13:19:11,618 - mmdet - INFO - Epoch [9][10/25]       lr: 2.500e-04, eta: 0:01:16, time: 0.344, data_time: 0.237, memory: 2112, loss_cls: 5.3743, loss_bbox: 2.8707, loss: 8.2449
2023-02-10 13:19:12,808 - mmdet - INFO - Epoch [9][20/25]       lr: 2.500e-04, eta: 0:01:13, time: 0.119, data_time: 0.013, memory: 2112, loss_cls: 5.3566, loss_bbox: 2.6674, loss: 8.0240

When disable the build_dp part, the logs are normal, i.e., running the following codes

import copy
import torch
import os.path as osp

import mmcv
import numpy as np

from mmdet.datasets.builder import DATASETS
from mmdet.datasets.custom import CustomDataset

@DATASETS.register_module()
class KittiTinyDataset(CustomDataset):

    CLASSES = ('Car', 'Pedestrian', 'Cyclist')

    def load_annotations(self, ann_file):
        cat2label = {k: i for i, k in enumerate(self.CLASSES)}
        # load image list from file
        image_list = mmcv.list_from_file(self.ann_file)

        data_infos = []
        # convert annotations to middle format
        for image_id in image_list:
            filename = f'{self.img_prefix}/{image_id}.jpeg'
            image = mmcv.imread(filename)
            height, width = image.shape[:2]

            data_info = dict(filename=f'{image_id}.jpeg', width=width, height=height)

            # load annotations
            label_prefix = self.img_prefix.replace('image_2', 'label_2')
            lines = mmcv.list_from_file(osp.join(label_prefix, f'{image_id}.txt'))

            content = [line.strip().split(' ') for line in lines]
            bbox_names = [x[0] for x in content]
            bboxes = [[float(info) for info in x[4:8]] for x in content]

            gt_bboxes = []
            gt_labels = []
            gt_bboxes_ignore = []
            gt_labels_ignore = []

            # filter 'DontCare'
            for bbox_name, bbox in zip(bbox_names, bboxes):
                if bbox_name in cat2label:
                    gt_labels.append(cat2label[bbox_name])
                    gt_bboxes.append(bbox)
                else:
                    gt_labels_ignore.append(-1)
                    gt_bboxes_ignore.append(bbox)

            data_anno = dict(
                bboxes=np.array(gt_bboxes, dtype=np.float32).reshape(-1, 4),
                labels=np.array(gt_labels, dtype=np.long),
                bboxes_ignore=np.array(gt_bboxes_ignore,
                                       dtype=np.float32).reshape(-1, 4),
                labels_ignore=np.array(gt_labels_ignore, dtype=np.long))

            data_info.update(ann=data_anno)
            data_infos.append(data_info)

        return data_infos

from mmcv import Config
cfg = Config.fromfile('./configs/ssd/ssd300_coco.py')

from mmdet.apis import set_random_seed

# Modify dataset type and path
cfg.dataset_type = 'KittiTinyDataset'
cfg.data_root = '../data/kitti_tiny/'

img_norm_cfg = dict(
    mean=[103.530, 116.280, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False)

train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='Resize',
        img_scale=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
                   (1333, 768), (1333, 800)],
        multiscale_mode='value',
        keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

cfg.data = data = dict(
                        train=dict(pipeline=train_pipeline),
                        val=dict(pipeline=test_pipeline),
                        test=dict(pipeline=test_pipeline))

cfg.data.test.type = 'KittiTinyDataset'
cfg.data.test.data_root = '../data/kitti_tiny/'
cfg.data.test.ann_file = 'train.txt'
cfg.data.test.img_prefix = 'training/image_2'

cfg.data.train.type = 'KittiTinyDataset'
cfg.data.train.data_root = '../data/kitti_tiny/'
cfg.data.train.ann_file = 'train.txt'
cfg.data.train.img_prefix = 'training/image_2'

cfg.data.val.type = 'KittiTinyDataset'
cfg.data.val.data_root = '../data/kitti_tiny/'
cfg.data.val.ann_file = 'val.txt'
cfg.data.val.img_prefix = 'training/image_2'

# modify num classes of the model in box head
cfg.model.bbox_head.num_classes = 3
# If we need to finetune a model based on a pre-trained detector, we need to
# use load_from to set the path of checkpoints.

cfg.load_from = 'checkpoints/ssd300_coco_20210803_015428-d231a06e.pth'

# Set up working dir to save files and logs.
cfg.work_dir = './tutorial_exps'

# The original learning rate (LR) is set for 8-GPU training.
# We divide it by 8 since we only use one GPU.
cfg.optimizer = dict(type='SGD', lr=2e-3, momentum=0.9, weight_decay=5e-4)
cfg.optimizer.lr = 2e-3 / 8
cfg.lr_config.warmup = None
cfg.log_config.interval = 10

# Change the evaluation metric since we use customized dataset.
cfg.evaluation.metric = 'mAP'
# We can set the evaluation interval to reduce the evaluation times
cfg.evaluation.interval = 12
# We can set the checkpoint saving interval to reduce the storage cost
cfg.checkpoint_config.interval = 12

# Set seed thus the results are more reproducible
cfg.seed = 0
set_random_seed(0, deterministic=False)
cfg.device = 'cuda'
cfg.gpu_ids = range(6,7) #assign gpu ids

# We can also use tensorboard to log the training process
cfg.log_config.hooks = [
    dict(type='TextLoggerHook'),
    dict(type='TensorboardLoggerHook')]

# We can initialize the logger for training and have a look
# at the final config used for training
print(f'Config:\n{cfg.pretty_text}')

from mmdet.datasets import build_dataset, build_dataloader
from mmdet.models import build_detector
from mmdet.apis import train_detector
from mmdet.apis import single_gpu_test
from mmcv.runner.checkpoint import get_state_dict, weights_to_cpu, load_state_dict

# Build dataset
datasets = [build_dataset(cfg.data.train)]

# Build the detector
model = build_detector(cfg.model)
# Add an attribute for visualization convenience
model.CLASSES = datasets[0].CLASSES
# Create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))

'''
print('='*86)
print('\nEvaluation of Init Model\n')
from mmdet.datasets import replace_ImageToTensor
# Replace 'ImageToTensor' to 'DefaultFormatBundle'
cfg.data.val.pipeline = replace_ImageToTensor(cfg.data.val.pipeline)

# ======== Make validation loader ==========
val_dataloader_default_args = dict(
    samples_per_gpu=1,
    workers_per_gpu=2,
    dist=False, 
    shuffle=False,
    persistent_workers=False)

val_dataloader_args = {
    **val_dataloader_default_args,
    **cfg.data.get('val_dataloader', {})
}

eval_cfg = cfg.get('evaluation', {})
eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'

val_dataset = build_dataset(cfg.data.val, dict(test_mode=True))
val_dataloader = build_dataloader(val_dataset, **val_dataloader_args)
# ==========================================

from mmdet.utils import build_dp, get_root_logger
model = build_dp(model, device='cuda', device_ids=[6])
logger = get_root_logger(log_level=cfg.log_level)
results = single_gpu_test(model, val_dataloader, show=False)
val_dataloader.dataset.evaluate(results, logger=logger, metric='mAP')

print('\n'+'='*86)
model.train()
'''
train_detector(model, datasets, cfg, distributed=False, validate=True)

The logs are

023-02-10 13:21:02,194 - mmdet - INFO - Epoch [1][10/25]       lr: 2.500e-04, eta: 0:04:30, time: 0.459, data_time: 0.227, memory: 2205, loss_cls: 7.4602, loss_bbox: 1.0205, loss: 8.4807
2023-02-10 13:21:03,404 - mmdet - INFO - Epoch [1][20/25]       lr: 2.500e-04, eta: 0:02:48, time: 0.121, data_time: 0.012, memory: 2205, loss_cls: 5.4920, loss_bbox: 1.3897, loss: 6.8817
2023-02-10 13:21:07,543 - mmdet - INFO - Epoch [2][10/25]       lr: 2.500e-04, eta: 0:02:28, time: 0.338, data_time: 0.229, memory: 2205, loss_cls: 4.3596, loss_bbox: 0.8782, loss: 5.2378
2023-02-10 13:21:08,748 - mmdet - INFO - Epoch [2][20/25]       lr: 2.500e-04, eta: 0:02:08, time: 0.121, data_time: 0.011, memory: 2205, loss_cls: 3.7894, loss_bbox: 0.7668, loss: 4.5562
2023-02-10 13:21:12,917 - mmdet - INFO - Epoch [3][10/25]       lr: 2.500e-04, eta: 0:02:03, time: 0.339, data_time: 0.230, memory: 2205, loss_cls: 3.1810, loss_bbox: 0.5807, loss: 3.7617
2023-02-10 13:21:14,109 - mmdet - INFO - Epoch [3][20/25]       lr: 2.500e-04, eta: 0:01:53, time: 0.119, data_time: 0.012, memory: 2205, loss_cls: 2.8313, loss_bbox: 0.8064, loss: 3.6377
2023-02-10 13:21:18,189 - mmdet - INFO - Epoch [4][10/25]       lr: 2.500e-04, eta: 0:01:51, time: 0.337, data_time: 0.229, memory: 2205, loss_cls: 2.5291, loss_bbox: 0.6121, loss: 3.1412
2023-02-10 13:21:19,388 - mmdet - INFO - Epoch [4][20/25]       lr: 2.500e-04, eta: 0:01:43, time: 0.120, data_time: 0.011, memory: 2205, loss_cls: 2.5040, loss_bbox: 0.6917, loss: 3.1956
2023-02-10 13:21:23,516 - mmdet - INFO - Epoch [5][10/25]       lr: 2.500e-04, eta: 0:01:42, time: 0.337, data_time: 0.227, memory: 2205, loss_cls: 2.4289, loss_bbox: 0.5479, loss: 2.9768
2023-02-10 13:21:24,719 - mmdet - INFO - Epoch [5][20/25]       lr: 2.500e-04, eta: 0:01:36, time: 0.120, data_time: 0.011, memory: 2205, loss_cls: 2.3861, loss_bbox: 0.5485, loss: 2.9347
2023-02-10 13:21:28,820 - mmdet - INFO - Epoch [6][10/25]       lr: 2.500e-04, eta: 0:01:34, time: 0.337, data_time: 0.229, memory: 2205, loss_cls: 2.1447, loss_bbox: 0.5362, loss: 2.6809
2023-02-10 13:21:30,031 - mmdet - INFO - Epoch [6][20/25]       lr: 2.500e-04, eta: 0:01:30, time: 0.121, data_time: 0.011, memory: 2205, loss_cls: 2.0763, loss_bbox: 0.5670, loss: 2.6433
2023-02-10 13:21:34,180 - mmdet - INFO - Epoch [7][10/25]       lr: 2.500e-04, eta: 0:01:28, time: 0.339, data_time: 0.229, memory: 2205, loss_cls: 2.0813, loss_bbox: 0.4592, loss: 2.5405
2023-02-10 13:21:35,377 - mmdet - INFO - Epoch [7][20/25]       lr: 2.500e-04, eta: 0:01:24, time: 0.120, data_time: 0.011, memory: 2205, loss_cls: 2.0711, loss_bbox: 0.5060, loss: 2.5771
2023-02-10 13:21:39,478 - mmdet - INFO - Epoch [8][10/25]       lr: 2.500e-04, eta: 0:01:22, time: 0.337, data_time: 0.228, memory: 2205, loss_cls: 2.0096, loss_bbox: 0.3462, loss: 2.3559
2023-02-10 13:21:40,667 - mmdet - INFO - Epoch [8][20/25]       lr: 2.500e-04, eta: 0:01:18, time: 0.119, data_time: 0.011, memory: 2205, loss_cls: 1.8995, loss_bbox: 0.4683, loss: 2.3678
2023-02-10 13:21:44,801 - mmdet - INFO - Epoch [9][10/25]       lr: 2.500e-04, eta: 0:01:16, time: 0.336, data_time: 0.228, memory: 2205, loss_cls: 1.8832, loss_bbox: 0.3990, loss: 2.2821
2023-02-10 13:21:45,997 - mmdet - INFO - Epoch [9][20/25]       lr: 2.500e-04, eta: 0:01:13, time: 0.120, data_time: 0.011, memory: 2205, loss_cls: 1.6880, loss_bbox: 0.2541, loss: 1.9421

Additional information

No response

BIGWangYuDong commented 1 year ago

Hi, sorry for the late reply, I did not get your point. Seems the log all normal?

AArchLichKing commented 1 year ago

@BIGWangYuDong The loss is not decreasing in the first set of logs.