`find_unused_parameters` after several epoch training when training YOLOX

igo312 commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug When training YOLOX, there is a find_unused_parameters after several epoch training And I follow link1 that set detect_anomalous_params=True After that, there produce some hint said the weight and bias of multi_level_conv_reg does not join the loss computation, it is really weird. All I change is using my dataset and it's fine when I train a Faster RCNN

Reproduction

What command or script did you run?

python ./tools/train.py ./configs/alpha_mot0220/yolox_s_8x8_300e_coco_car.py

Did you make any modifications on the code or config? Did you understand what you have modified?
What dataset did you use? a own dataset that using COCO format Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.


sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: False
GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
PyTorch: 1.9.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.10.0 OpenCV: 4.5.5 MMCV: 1.4.6 MMCV Compiler: GCC 7.4 MMCV CUDA Compiler: 10.2 MMDetection: 2.21.0+e359d3f


**Error traceback**
If applicable, paste the error trackback here.

```none
A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

igo312 commented 2 years ago

At first I think the num of pos sample is zero that make pos_masks.any() is False, but it is not the problem. And I have rerun several times, it all said the multi_level_conv_reg do not join the loss when after several epoch training

hhaAndroid commented 2 years ago

@igo312 Can you upload your log？

taofuyu commented 2 years ago

@igo312 Can you upload your log？

It is weird. When I didnt set _find_unusedparameters True in config，training will report error. When I use it，LOG shows all params join the loss computation, I should turn-off _find_unusedparameters

igo312 commented 2 years ago

Sorry for late. I found the problem is thers is some annotation is empty on my own data. And it may make reg_conv not participate in the gradient propogation. In other word, SimOTA cannot deal with some img without gt_label.

And yolox seems to have issue at testing stage as well. I will find out what's wrong recentelly.

Tim-Hung commented 2 years ago

Hi @igo312 , I meet this issue when using YOLOX to train my own dataset in COCO format. I wonder know do you mean the background image(the image without any annotation) would cause this problem?

I use the subset of my own dataset, and I make sure there is at least 1 annotation in each image. But the error is still raised.

Hi @hhaAndroid , Here is my log report for your reference:

(MMD) [tim32338519@s4kufnyolox97road-sr94k ~/SoftTeacher]$ python -m torch.distributed.launch --nproc_per_node=2 --master_port=29500 tools/train.py configs/railway/yolox_road_300e.py --launcher pytorch
/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2022-03-26 16:40:55,181 - mmdet.ssod - INFO - [<StreamHandler <stderr> (INFO)>, <FileHandler /home/tim32338519/SoftTeacher/work_dirs/yolox_road_300e/20220326_164055.log (INFO)>]
2022-03-26 16:40:55,181 - mmdet.ssod - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.10 (default, Jun  4 2021, 15:09:15) [GCC 7.5.0]
CUDA available: True
GPU 0,1: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.4.5
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.21.0+97064fb
------------------------------------------------------------

2022-03-26 16:40:55,495 - mmdet.ssod - INFO - Distributed training: True
2022-03-26 16:40:55,798 - mmdet.ssod - INFO - Config:
optimizer = dict(
    type='SGD',
    lr=0.01,
    momentum=0.9,
    weight_decay=0.0005,
    nesterov=True,
    paramwise_cfg=dict(norm_decay_mult=0.0, bias_decay_mult=0.0))
optimizer_config = dict(grad_clip=None, detect_anomalous_params=True)
lr_config = dict(
    policy='YOLOX',
    warmup='exp',
    by_epoch=False,
    warmup_by_epoch=True,
    warmup_ratio=1,
    warmup_iters=5,
    num_last_epochs=15,
    min_lr_ratio=0.05)
runner = dict(type='EpochBasedRunner', max_epochs=300)
checkpoint_config = dict(interval=10, max_keep_ckpts=1)
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(
            type='WandbLoggerHook',
            init_kwargs=dict(
                entity='tim32338519',
                project='Road',
                name='yolox_road_300e',
                config=dict(
                    work_dirs='./work_dirs/yolox_road_300e', total_step=300)),
            by_epoch=False)
    ])
custom_hooks = [
    dict(type='YOLOXModeSwitchHook', num_last_epochs=15, priority=48),
    dict(type='SyncNormHook', num_last_epochs=15, interval=10, priority=48),
    dict(
        type='ExpMomentumEMAHook',
        resume_from=None,
        momentum=0.0001,
        priority=49)
]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
img_scale = (640, 640)
model = dict(
    type='YOLOX',
    input_size=(640, 640),
    random_size_range=(15, 25),
    random_size_interval=10,
    backbone=dict(type='CSPDarknet', deepen_factor=0.33, widen_factor=0.5),
    neck=dict(
        type='YOLOXPAFPN',
        in_channels=[128, 256, 512],
        out_channels=128,
        num_csp_blocks=1),
    bbox_head=dict(
        type='YOLOXHead', num_classes=4, in_channels=128, feat_channels=128),
    train_cfg=dict(assigner=dict(type='SimOTAAssigner', center_radius=2.5)),
    test_cfg=dict(score_thr=0.01, nms=dict(type='nms', iou_threshold=0.65)))
train_pipeline = [
    dict(type='Mosaic', img_scale=(640, 640), pad_val=114.0),
    dict(
        type='RandomAffine', scaling_ratio_range=(0.1, 2),
        border=(-320, -320)),
    dict(
        type='MixUp',
        img_scale=(640, 640),
        ratio_range=(0.8, 1.6),
        pad_val=114.0),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Resize', img_scale=(640, 640), keep_ratio=True),
    dict(
        type='Pad',
        pad_to_square=True,
        pad_val=dict(img=(114.0, 114.0, 114.0))),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(640, 640),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Pad',
                pad_to_square=True,
                pad_val=dict(img=(114.0, 114.0, 114.0))),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img'])
        ])
]
classes = ('D00', 'D10', 'D20', 'D40')
dataset_type = 'CocoDataset'
data_root = 'data/road/'
train_dataset = dict(
    type='MultiImageMixDataset',
    dataset=dict(
        type='CocoDataset',
        classes=('D00', 'D10', 'D20', 'D40'),
        ann_file='data/road/train_all.json',
        img_prefix='data/road/train/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True)
        ],
        filter_empty_gt=False),
    pipeline=[
        dict(type='Mosaic', img_scale=(640, 640), pad_val=114.0),
        dict(
            type='RandomAffine',
            scaling_ratio_range=(0.1, 2),
            border=(-320, -320)),
        dict(
            type='MixUp',
            img_scale=(640, 640),
            ratio_range=(0.8, 1.6),
            pad_val=114.0),
        dict(type='YOLOXHSVRandomAug'),
        dict(type='RandomFlip', flip_ratio=0.5),
        dict(type='Resize', img_scale=(640, 640), keep_ratio=True),
        dict(
            type='Pad',
            pad_to_square=True,
            pad_val=dict(img=(114.0, 114.0, 114.0))),
        dict(
            type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False),
        dict(type='DefaultFormatBundle'),
        dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
    ])
data = dict(
    samples_per_gpu=32,
    workers_per_gpu=4,
    persistent_workers=True,
    train=dict(
        type='MultiImageMixDataset',
        dataset=dict(
            type='CocoDataset',
            classes=('D00', 'D10', 'D20', 'D40'),
            ann_file='data/road/train_all.json',
            img_prefix='data/road/train/',
            pipeline=[
                dict(type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True)
            ],
            filter_empty_gt=False),
        pipeline=[
            dict(type='Mosaic', img_scale=(640, 640), pad_val=114.0),
            dict(
                type='RandomAffine',
                scaling_ratio_range=(0.1, 2),
                border=(-320, -320)),
            dict(
                type='MixUp',
                img_scale=(640, 640),
                ratio_range=(0.8, 1.6),
                pad_val=114.0),
            dict(type='YOLOXHSVRandomAug'),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(type='Resize', img_scale=(640, 640), keep_ratio=True),
            dict(
                type='Pad',
                pad_to_square=True,
                pad_val=dict(img=(114.0, 114.0, 114.0))),
            dict(
                type='FilterAnnotations',
                min_gt_bbox_wh=(1, 1),
                keep_empty=False),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ]),
    val=dict(
        type='CocoDataset',
        classes=('D00', 'D10', 'D20', 'D40'),
        ann_file='data/road/test_all.json',
        img_prefix='data/road/test/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(640, 640),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Pad',
                        pad_to_square=True,
                        pad_val=dict(img=(114.0, 114.0, 114.0))),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        classes=('D00', 'D10', 'D20', 'D40'),
        ann_file='data/road/test_all.json',
        img_prefix='data/road/test/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(640, 640),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Pad',
                        pad_to_square=True,
                        pad_val=dict(img=(114.0, 114.0, 114.0))),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
max_epochs = 300
num_last_epochs = 15
interval = 10
evaluation = dict(
    save_best='auto', interval=10, dynamic_intervals=[(285, 1)], metric='bbox')
work_dir = './work_dirs/yolox_road_300e'
cfg_name = 'yolox_road_300e'
gpu_ids = range(0, 2)

2022-03-26 16:40:55,937 - mmdet.ssod - INFO - initialize CSPDarknet with init_cfg {'type': 'Kaiming', 'layer': 'Conv2d', 'a': 2.23606797749979, 'distribution': 'uniform', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu'}
2022-03-26 16:40:55,965 - mmdet.ssod - INFO - initialize YOLOXPAFPN with init_cfg {'type': 'Kaiming', 'layer': 'Conv2d', 'a': 2.23606797749979, 'distribution': 'uniform', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu'}
2022-03-26 16:40:55,985 - mmdet.ssod - INFO - initialize YOLOXHead with init_cfg {'type': 'Kaiming', 'layer': 'Conv2d', 'a': 2.23606797749979, 'distribution': 'uniform', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu'}
loading annotations into memory...
loading annotations into memory...
Done (t=0.17s)
creating index...
index created!
Done (t=0.19s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
Done (t=0.03s)
creating index...
index created!
2022-03-26 16:40:59,201 - mmdet.ssod - INFO - Start running, host: tim32338519@s4kufnyolox97road-sr94k, work_dir: /home/tim32338519/SoftTeacher/work_dirs/yolox_road_300e
2022-03-26 16:40:59,202 - mmdet.ssod - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) YOLOXLrUpdaterHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) CheckpointHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) WandbLoggerHook
 --------------------
before_train_epoch:
(VERY_HIGH   ) YOLOXLrUpdaterHook
(48          ) YOLOXModeSwitchHook
(48          ) SyncNormHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) DistSamplerSeedHook
(LOW         ) IterTimerHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) WandbLoggerHook
 --------------------
before_train_iter:
(VERY_HIGH   ) YOLOXLrUpdaterHook
(LOW         ) IterTimerHook
(80          ) DistEvalHook
 --------------------
after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) CheckpointHook
(LOW         ) IterTimerHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) WandbLoggerHook
 --------------------
after_train_epoch:
(48          ) SyncNormHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) CheckpointHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) WandbLoggerHook
 --------------------
before_val_epoch:
(NORMAL      ) DistSamplerSeedHook
(LOW         ) IterTimerHook
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) WandbLoggerHook
 --------------------
before_val_iter:
(LOW         ) IterTimerHook
 --------------------
after_val_iter:
(LOW         ) IterTimerHook
 --------------------
after_val_epoch:
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) WandbLoggerHook
 --------------------
after_run:
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) WandbLoggerHook
 --------------------
2022-03-26 16:40:59,203 - mmdet.ssod - INFO - workflow: [('train', 1)], max: 300 epochs
2022-03-26 16:40:59,225 - mmdet.ssod - INFO - Checkpoints will be saved to /home/tim32338519/SoftTeacher/work_dirs/yolox_road_300e by HardDiskBackend.
wandb: Currently logged in as: tim32338519 (use `wandb login --relogin` to force relogin)
wandb: wandb version 0.12.11 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.7
wandb: Syncing run yolox_road_300e
wandb:  View project at https://wandb.ai/tim32338519/Road
wandb:  View run at https://wandb.ai/tim32338519/Road/runs/zbsp4f9l
wandb: Run data is saved locally in /home/tim32338519/SoftTeacher/wandb/run-20220326_164102-zbsp4f9l
wandb: Run `wandb offline` to turn off syncing.

/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.0.weight with shape torch.Size([4, 128, 1, 1]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.0.bias with shape torch.Size([4]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.1.weight with shape torch.Size([4, 128, 1, 1]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.1.bias with shape torch.Size([4]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.2.weight with shape torch.Size([4, 128, 1, 1]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.2.bias with shape torch.Size([4]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.0.weight with shape torch.Size([4, 128, 1, 1]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.0.bias with shape torch.Size([4]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.1.weight with shape torch.Size([4, 128, 1, 1]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.1.bias with shape torch.Size([4]) is not in the computational graph

2022-03-26 16:41:35,794 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.2.weight with shape torch.Size([4, 128, 1, 1]) is not in the computational graph

2022-03-26 16:41:35,795 - mmdet.ssod - ERROR - module.bbox_head.multi_level_conv_reg.2.bias with shape torch.Size([4]) is not in the computational graph

Traceback (most recent call last):
  File "tools/train.py", line 201, in <module>
    main()
  File "tools/train.py", line 189, in main
    train_detector(
  File "/home/tim32338519/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 42, in train_step
    and self.reducer._rebuild_buckets()):
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 228 229 230 231 232 233
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "tools/train.py", line 201, in <module>
    main()
  File "tools/train.py", line 189, in main
    train_detector(
  File "/home/tim32338519/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/home/tim32338519/.local/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 42, in train_step
    and self.reducer._rebuild_buckets()):
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 228 229 230 231 232 233
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

wandb: Waiting for W&B process to finish, PID 30374... (failed 1). Press ctrl-c to abort syncing.
wandb:
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced yolox_road_300e: https://wandb.ai/tim32338519/Road/runs/zbsp4f9l
wandb: Find logs at: ./wandb/run-20220326_164102-zbsp4f9l/logs/debug.log
wandb:
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30183 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 30184) of binary: /home/tim32338519/.conda/envs/MMD/bin/python
Traceback (most recent call last):
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-26_16:41:44
  host      : s4kufnyolox97road-sr94k
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 30184)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Thank you all for helping to solve this possible problem!

igo312 commented 2 years ago

I make sure there is at least 1 annotation in each image

@Tim-Hung Sorrt for late, but yes, in my mind, if you can ensure that every sample have at least 1 annotation, it should not raise error. However I found that the error raise immediatly, can you check whether the error raise at first iter?

Tim-Hung commented 2 years ago

Hi @igo312 , after I check my dataset (every sample have at least 1 annotation), the error does not raise. Maybe I can try to add some background samples to test if the error would raise.

However I found that the error raise immediatly, can you check whether the error raise at first iter?

Yes, the error always raise at the beginning of training.

Tim-Hung commented 2 years ago

I add some background samples with filter_empty_gt=False and the error does not raise again. The error raised last time when the name of category in annotation file mismatch(capital letter).

So I think there should be no error when we use correctly on YOLOX.

igo312 commented 2 years ago

@Tim-Hung Thanks for your hints, but are you sure your model encouter a sample without gt annotations? In my training time, the find_unused_parm error is raised during training not at beginning. And after I remove the sample without gt, there is no error anymore.
In other words, if you set filter_empty_gt=False and model encouter some sample without annotations, it should raise find_unused_parm error right?

Tim-Hung commented 2 years ago

Hi @igo312 , I set filter_empty_gt=False and do the experiment as below:

All sample has at least 1 annotation: After 1 epoch trained, No error raise, and 1 epoch has 171 iterations.
Add 4 samples without annotation: After 1 epoch trained, No error raise, and 1 epoch has 172 iterations.

I guess the model encouter a sample without gt annotations indeed because the number of iteration increase.

I will try to increase the training epoch in second experiment(sample without annotation) to see if find_unused_parm error raise during the training.

Tim-Hung commented 2 years ago

I increase the training epoch and the error didn't raise either.

I think find_unused_parameters=True check only once in the beginning of training.

Here is the training log:

(MMD) [tim32338519@ozk74stwcc99-smrb7 ~/SoftTeacher]$ python -m torch.distributed.launch --nproc_per_node=2 --master_port=29500 tools/train.py configs/railway/yolox_road_300e.py --launcher pytorch
/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2022-04-06 20:09:36,423 - mmdet.ssod - INFO - [<StreamHandler <stderr> (INFO)>, <FileHandler /home/tim32338519/SoftTeacher/work_dirs/yolox_road_300e/20220406_200936.log (INFO)>]
2022-04-06 20:09:36,423 - mmdet.ssod - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.10 (default, Jun  4 2021, 15:09:15) [GCC 7.5.0]
CUDA available: True
GPU 0,1: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.4.5
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.21.0+97064fb
------------------------------------------------------------

2022-04-06 20:09:36,732 - mmdet.ssod - INFO - Distributed training: True
2022-04-06 20:09:37,029 - mmdet.ssod - INFO - Config:
optimizer = dict(
    type='SGD',
    lr=0.01,
    momentum=0.9,
    weight_decay=0.0005,
    nesterov=True,
    paramwise_cfg=dict(norm_decay_mult=0.0, bias_decay_mult=0.0))
optimizer_config = dict(grad_clip=None, detect_anomalous_params=True)
lr_config = dict(
    policy='YOLOX',
    warmup='exp',
    by_epoch=False,
    warmup_by_epoch=True,
    warmup_ratio=1,
    warmup_iters=5,
    num_last_epochs=15,
    min_lr_ratio=0.05)
runner = dict(type='EpochBasedRunner', max_epochs=300)
checkpoint_config = dict(interval=10, max_keep_ckpts=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [
    dict(type='YOLOXModeSwitchHook', num_last_epochs=15, priority=48),
    dict(type='SyncNormHook', num_last_epochs=15, interval=10, priority=48),
    dict(
        type='ExpMomentumEMAHook',
        resume_from=None,
        momentum=0.0001,
        priority=49)
]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
img_scale = (640, 640)
model = dict(
    type='YOLOX',
    input_size=(640, 640),
    random_size_range=(15, 25),
    random_size_interval=10,
    backbone=dict(type='CSPDarknet', deepen_factor=0.33, widen_factor=0.5),
    neck=dict(
        type='YOLOXPAFPN',
        in_channels=[128, 256, 512],
        out_channels=128,
        num_csp_blocks=1),
    bbox_head=dict(
        type='YOLOXHead', num_classes=4, in_channels=128, feat_channels=128),
    train_cfg=dict(assigner=dict(type='SimOTAAssigner', center_radius=2.5)),
    test_cfg=dict(score_thr=0.01, nms=dict(type='nms', iou_threshold=0.65)))
train_pipeline = [
    dict(type='Mosaic', img_scale=(640, 640), pad_val=114.0),
    dict(
        type='RandomAffine', scaling_ratio_range=(0.1, 2),
        border=(-320, -320)),
    dict(
        type='MixUp',
        img_scale=(640, 640),
        ratio_range=(0.8, 1.6),
        pad_val=114.0),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Resize', img_scale=(640, 640), keep_ratio=True),
    dict(
        type='Pad',
        pad_to_square=True,
        pad_val=dict(img=(114.0, 114.0, 114.0))),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(640, 640),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Pad',
                pad_to_square=True,
                pad_val=dict(img=(114.0, 114.0, 114.0))),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img'])
        ])
]
find_unused_parameters = True
classes = ('D00', 'D10', 'D20', 'D40')
dataset_type = 'CocoDataset'
data_root = 'data/road/'
train_dataset = dict(
    type='MultiImageMixDataset',
    dataset=dict(
        type='CocoDataset',
        classes=('D00', 'D10', 'D20', 'D40'),
        ann_file='data/road/train_bg.json',
        img_prefix='data/road/train/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True)
        ],
        filter_empty_gt=False),
    pipeline=[
        dict(type='Mosaic', img_scale=(640, 640), pad_val=114.0),
        dict(
            type='RandomAffine',
            scaling_ratio_range=(0.1, 2),
            border=(-320, -320)),
        dict(
            type='MixUp',
            img_scale=(640, 640),
            ratio_range=(0.8, 1.6),
            pad_val=114.0),
        dict(type='YOLOXHSVRandomAug'),
        dict(type='RandomFlip', flip_ratio=0.5),
        dict(type='Resize', img_scale=(640, 640), keep_ratio=True),
        dict(
            type='Pad',
            pad_to_square=True,
            pad_val=dict(img=(114.0, 114.0, 114.0))),
        dict(
            type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False),
        dict(type='DefaultFormatBundle'),
        dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
    ])
data = dict(
    samples_per_gpu=32,
    workers_per_gpu=4,
    persistent_workers=True,
    train=dict(
        type='MultiImageMixDataset',
        dataset=dict(
            type='CocoDataset',
            classes=('D00', 'D10', 'D20', 'D40'),
            ann_file='data/road/train_bg.json',
            img_prefix='data/road/train/',
            pipeline=[
                dict(type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True)
            ],
            filter_empty_gt=False),
        pipeline=[
            dict(type='Mosaic', img_scale=(640, 640), pad_val=114.0),
            dict(
                type='RandomAffine',
                scaling_ratio_range=(0.1, 2),
                border=(-320, -320)),
            dict(
                type='MixUp',
                img_scale=(640, 640),
                ratio_range=(0.8, 1.6),
                pad_val=114.0),
            dict(type='YOLOXHSVRandomAug'),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(type='Resize', img_scale=(640, 640), keep_ratio=True),
            dict(
                type='Pad',
                pad_to_square=True,
                pad_val=dict(img=(114.0, 114.0, 114.0))),
            dict(
                type='FilterAnnotations',
                min_gt_bbox_wh=(1, 1),
                keep_empty=False),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ]),
    val=dict(
        type='CocoDataset',
        classes=('D00', 'D10', 'D20', 'D40'),
        ann_file='data/road/test_all.json',
        img_prefix='data/road/test/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(640, 640),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Pad',
                        pad_to_square=True,
                        pad_val=dict(img=(114.0, 114.0, 114.0))),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        classes=('D00', 'D10', 'D20', 'D40'),
        ann_file='data/road/test_all.json',
        img_prefix='data/road/test/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(640, 640),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Pad',
                        pad_to_square=True,
                        pad_val=dict(img=(114.0, 114.0, 114.0))),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
max_epochs = 300
num_last_epochs = 15
interval = 10
evaluation = dict(
    save_best='auto', interval=10, dynamic_intervals=[(285, 1)], metric='bbox')
work_dir = './work_dirs/yolox_road_300e'
cfg_name = 'yolox_road_300e'
gpu_ids = range(0, 2)

2022-04-06 20:09:37,169 - mmdet.ssod - INFO - initialize CSPDarknet with init_cfg {'type': 'Kaiming', 'layer': 'Conv2d', 'a': 2.23606797749979, 'distribution': 'uniform', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu'}
2022-04-06 20:09:37,197 - mmdet.ssod - INFO - initialize YOLOXPAFPN with init_cfg {'type': 'Kaiming', 'layer': 'Conv2d', 'a': 2.23606797749979, 'distribution': 'uniform', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu'}
2022-04-06 20:09:37,217 - mmdet.ssod - INFO - initialize YOLOXHead with init_cfg {'type': 'Kaiming', 'layer': 'Conv2d', 'a': 2.23606797749979, 'distribution': 'uniform', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu'}
loading annotations into memory...
loading annotations into memory...
Done (t=0.22s)
creating index...
Done (t=0.22s)
creating index...
index created!
index created!
loading annotations into memory...
loading annotations into memory...
Done (t=0.05s)
creating index...
Done (t=0.05s)
creating index...
index created!
index created!
2022-04-06 20:09:40,437 - mmdet.ssod - INFO - Start running, host: tim32338519@ozk74stwcc99-smrb7, work_dir: /home/tim32338519/SoftTeacher/work_dirs/yolox_road_300e
2022-04-06 20:09:40,438 - mmdet.ssod - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) YOLOXLrUpdaterHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) CheckpointHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
 --------------------
before_train_epoch:
(VERY_HIGH   ) YOLOXLrUpdaterHook
(48          ) YOLOXModeSwitchHook
(48          ) SyncNormHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) DistSamplerSeedHook
(LOW         ) IterTimerHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
 --------------------
before_train_iter:
(VERY_HIGH   ) YOLOXLrUpdaterHook
(LOW         ) IterTimerHook
(80          ) DistEvalHook
 --------------------
after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) CheckpointHook
(LOW         ) IterTimerHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
 --------------------
after_train_epoch:
(48          ) SyncNormHook
(49          ) ExpMomentumEMAHook
(NORMAL      ) CheckpointHook
(80          ) DistEvalHook
(VERY_LOW    ) TextLoggerHook
 --------------------
before_val_epoch:
(NORMAL      ) DistSamplerSeedHook
(LOW         ) IterTimerHook
(VERY_LOW    ) TextLoggerHook
 --------------------
before_val_iter:
(LOW         ) IterTimerHook
 --------------------
after_val_iter:
(LOW         ) IterTimerHook
 --------------------
after_val_epoch:
(VERY_LOW    ) TextLoggerHook
 --------------------
after_run:
(VERY_LOW    ) TextLoggerHook
 --------------------
2022-04-06 20:09:40,438 - mmdet.ssod - INFO - workflow: [('train', 1)], max: 300 epochs
2022-04-06 20:09:40,452 - mmdet.ssod - INFO - Checkpoints will be saved to /home/tim32338519/SoftTeacher/work_dirs/yolox_road_300e by HardDiskBackend.
/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/tim32338519/.conda/envs/MMD/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
2022-04-06 20:11:07,767 - mmdet.ssod - INFO - Epoch [1][50/172] lr: 3.380e-05, eta: 1 day, 0:59:43, time: 1.746, data_time: 0.659, memory: 15597, loss_cls: 0.8099, loss_bbox: 4.7899, loss_obj: 17.4360, loss: 23.0358
2022-04-06 20:12:06,679 - mmdet.ssod - INFO - Epoch [1][100/172]        lr: 1.352e-04, eta: 20:54:59, time: 1.179, data_time: 0.102, memory: 24319, loss_cls: 0.8143, loss_bbox: 4.7777, loss_obj: 17.1305, loss: 22.7225
2022-04-06 20:13:06,787 - mmdet.ssod - INFO - Epoch [1][150/172]        lr: 3.042e-04, eta: 19:39:17, time: 1.202, data_time: 0.099, memory: 24319, loss_cls: 1.0847, loss_bbox: 4.6222, loss_obj: 9.4065, loss: 15.1134
2022-04-06 20:14:56,890 - mmdet.ssod - INFO - Epoch [2][50/172] lr: 6.664e-04, eta: 18:57:59, time: 1.775, data_time: 0.817, memory: 24335, loss_cls: 1.5463, loss_bbox: 4.1129, loss_obj: 6.3563, loss: 12.0154
2022-04-06 20:15:57,172 - mmdet.ssod - INFO - Epoch [2][100/172]        lr: 1.000e-03, eta: 18:37:30, time: 1.206, data_time: 0.251, memory: 24335, loss_cls: 1.4486, loss_bbox: 3.9536, loss_obj: 5.8896, loss: 11.2919
2022-04-06 20:16:54,342 - mmdet.ssod - INFO - Epoch [2][150/172]        lr: 1.402e-03, eta: 18:14:42, time: 1.143, data_time: 0.260, memory: 24335, loss_cls: 1.3444, loss_bbox: 3.8933, loss_obj: 5.8570, loss: 11.0947
2022-04-06 20:18:47,451 - mmdet.ssod - INFO - Epoch [3][50/172] lr: 2.099e-03, eta: 18:08:13, time: 1.799, data_time: 0.662, memory: 24335, loss_cls: 1.3472, loss_bbox: 3.7906, loss_obj: 6.0009, loss: 11.1387
2022-04-06 20:19:46,128 - mmdet.ssod - INFO - Epoch [3][100/172]        lr: 2.665e-03, eta: 17:57:24, time: 1.174, data_time: 0.085, memory: 24335, loss_cls: 1.3307, loss_bbox: 3.7111, loss_obj: 6.0522, loss: 11.0940
2022-04-06 20:20:46,864 - mmdet.ssod - INFO - Epoch [3][150/172]        lr: 3.300e-03, eta: 17:52:07, time: 1.215, data_time: 0.214, memory: 24335, loss_cls: 1.3190, loss_bbox: 3.6778, loss_obj: 6.0813, loss: 11.0782
2022-04-06 20:22:36,551 - mmdet.ssod - INFO - Epoch [4][50/172] lr: 4.331e-03, eta: 17:45:30, time: 1.745, data_time: 0.762, memory: 24335, loss_cls: 1.3092, loss_bbox: 3.6350, loss_obj: 6.0488, loss: 10.9929
2022-04-06 20:23:32,788 - mmdet.ssod - INFO - Epoch [4][100/172]        lr: 5.131e-03, eta: 17:35:38, time: 1.125, data_time: 0.150, memory: 24335, loss_cls: 1.2709, loss_bbox: 3.6292, loss_obj: 5.8320, loss: 10.7321
2022-04-06 20:24:34,143 - mmdet.ssod - INFO - Epoch [4][150/172]        lr: 5.997e-03, eta: 17:33:38, time: 1.227, data_time: 0.132, memory: 24335, loss_cls: 1.2670, loss_bbox: 3.5541, loss_obj: 6.0897, loss: 10.9108
2022-04-06 20:26:23,477 - mmdet.ssod - INFO - Epoch [5][50/172] lr: 7.364e-03, eta: 17:29:30, time: 1.742, data_time: 0.680, memory: 24335, loss_cls: 1.2279, loss_bbox: 3.5684, loss_obj: 5.6275, loss: 10.4238
2022-04-06 20:27:24,283 - mmdet.ssod - INFO - Epoch [5][100/172]        lr: 8.396e-03, eta: 17:27:18, time: 1.216, data_time: 0.087, memory: 24335, loss_cls: 1.2263, loss_bbox: 3.4780, loss_obj: 5.8306, loss: 10.5348
2022-04-06 20:28:22,975 - mmdet.ssod - INFO - Epoch [5][150/172]        lr: 9.495e-03, eta: 17:23:05, time: 1.174, data_time: 0.210, memory: 24335, loss_cls: 1.2127, loss_bbox: 3.4827, loss_obj: 5.6518, loss: 10.3473
2022-04-06 20:30:14,378 - mmdet.ssod - INFO - Epoch [6][50/172] lr: 1.000e-02, eta: 17:20:36, time: 1.754, data_time: 0.765, memory: 24335, loss_cls: 1.2007, loss_bbox: 3.4345, loss_obj: 5.6770, loss: 10.3122
2022-04-06 20:31:12,632 - mmdet.ssod - INFO - Epoch [6][100/172]        lr: 1.000e-02, eta: 17:16:37, time: 1.165, data_time: 0.133, memory: 24335, loss_cls: 1.1772, loss_bbox: 3.4311, loss_obj: 5.5279, loss: 10.1362
2022-04-06 20:32:13,661 - mmdet.ssod - INFO - Epoch [6][150/172]        lr: 1.000e-02, eta: 17:15:18, time: 1.221, data_time: 0.235, memory: 24335, loss_cls: 1.1714, loss_bbox: 3.3824, loss_obj: 5.5948, loss: 10.1486
2022-04-06 20:34:04,670 - mmdet.ssod - INFO - Epoch [7][50/172] lr: 1.000e-02, eta: 17:13:09, time: 1.751, data_time: 0.749, memory: 24335, loss_cls: 1.1591, loss_bbox: 3.3774, loss_obj: 5.4092, loss: 9.9457
2022-04-06 20:35:02,846 - mmdet.ssod - INFO - Epoch [7][100/172]        lr: 9.999e-03, eta: 17:09:46, time: 1.164, data_time: 0.172, memory: 24335, loss_cls: 1.1354, loss_bbox: 3.3613, loss_obj: 5.3210, loss: 9.8177
2022-04-06 20:36:04,023 - mmdet.ssod - INFO - Epoch [7][150/172]        lr: 9.999e-03, eta: 17:08:43, time: 1.224, data_time: 0.229, memory: 24335, loss_cls: 1.1411, loss_bbox: 3.3485, loss_obj: 5.2683, loss: 9.7580
2022-04-06 20:37:56,369 - mmdet.ssod - INFO - Epoch [8][50/172] lr: 9.998e-03, eta: 17:08:23, time: 1.797, data_time: 0.907, memory: 24335, loss_cls: 1.1344, loss_bbox: 3.3772, loss_obj: 5.2484, loss: 9.7600
2022-04-06 20:38:55,501 - mmdet.ssod - INFO - Epoch [8][100/172]        lr: 9.998e-03, eta: 17:05:59, time: 1.182, data_time: 0.108, memory: 24335, loss_cls: 1.1344, loss_bbox: 3.3213, loss_obj: 5.2648, loss: 9.7205
2022-04-06 20:39:56,358 - mmdet.ssod - INFO - Epoch [8][150/172]        lr: 9.998e-03, eta: 17:04:45, time: 1.217, data_time: 0.213, memory: 24335, loss_cls: 1.1119, loss_bbox: 3.3048, loss_obj: 5.1870, loss: 9.6037
2022-04-06 20:41:47,118 - mmdet.ssod - INFO - Epoch [9][50/172] lr: 9.997e-03, eta: 17:03:25, time: 1.767, data_time: 0.705, memory: 24335, loss_cls: 1.1225, loss_bbox: 3.3227, loss_obj: 5.0294, loss: 9.4746
2022-04-06 20:42:45,297 - mmdet.ssod - INFO - Epoch [9][100/172]        lr: 9.996e-03, eta: 17:00:41, time: 1.164, data_time: 0.123, memory: 24335, loss_cls: 1.1129, loss_bbox: 3.3050, loss_obj: 5.0691, loss: 9.4870
2022-04-06 20:43:45,865 - mmdet.ssod - INFO - Epoch [9][150/172]        lr: 9.996e-03, eta: 16:59:24, time: 1.212, data_time: 0.172, memory: 24335, loss_cls: 1.1108, loss_bbox: 3.2987, loss_obj: 5.1168, loss: 9.5263
2022-04-06 20:45:37,084 - mmdet.ssod - INFO - Epoch [10][50/172]        lr: 9.994e-03, eta: 16:57:33, time: 1.745, data_time: 0.646, memory: 24335, loss_cls: 1.0978, loss_bbox: 3.2523, loss_obj: 5.0035, loss: 9.3536
2022-04-06 20:46:36,988 - mmdet.ssod - INFO - Epoch [10][100/172]       lr: 9.994e-03, eta: 16:55:57, time: 1.198, data_time: 0.099, memory: 24335, loss_cls: 1.0940, loss_bbox: 3.2491, loss_obj: 4.9950, loss: 9.3381
2022-04-06 20:47:34,286 - mmdet.ssod - INFO - Epoch [10][150/172]       lr: 9.993e-03, eta: 16:53:07, time: 1.146, data_time: 0.103, memory: 24335, loss_cls: 1.0958, loss_bbox: 3.2386, loss_obj: 4.9361, loss: 9.2704
/home/tim32338519/SoftTeacher/thirdparty/mmdetection/mmdet/core/utils/dist_utils.py:118: UserWarning: group` is deprecated. Currently only supports NCCL backend.
  warnings.warn(
/home/tim32338519/SoftTeacher/thirdparty/mmdetection/mmdet/core/utils/dist_utils.py:135: UserWarning: Note: the "to_float" is True, you need to ensure that the behavior is reasonable.
  warnings.warn('Note: the "to_float" is True, you need to '
/home/tim32338519/SoftTeacher/thirdparty/mmdetection/mmdet/core/utils/dist_utils.py:118: UserWarning: group` is deprecated. Currently only supports NCCL backend.
  warnings.warn(
/home/tim32338519/SoftTeacher/thirdparty/mmdetection/mmdet/core/utils/dist_utils.py:135: UserWarning: Note: the "to_float" is True, you need to ensure that the behavior is reasonable.
  warnings.warn('Note: the "to_float" is True, you need to '
2022-04-06 20:47:58,282 - mmdet.ssod - INFO - Saving checkpoint at 10 epochs
[                                                  ] 0/3152, elapsed: 0s, ETA:find_unused_parameters=True/home/tim32338519/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/yolox_head.py:284: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068694/work/torch/csrc/utils/tensor_new.cpp:201.)
  flatten_bboxes[..., :4] /= flatten_bboxes.new_tensor(
/home/tim32338519/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/yolox_head.py:284: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068694/work/torch/csrc/utils/tensor_new.cpp:201.)
  flatten_bboxes[..., :4] /= flatten_bboxes.new_tensor(
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 3152/3152, 63.2 task/s, elapsed: 50s, ETA:     0s

2022-04-06 20:48:52,334 - mmdet.ssod - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.64s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=8.77s).
Accumulating evaluation results...
DONE (t=5.23s).
2022-04-06 20:49:07,974 - mmdet.ssod - INFO -
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.018
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.063
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.005
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.006
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.011
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.021
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.021
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.154
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.223

2022-04-06 20:49:08,389 - mmdet.ssod - INFO - Now best checkpoint is saved as best_bbox_mAP_epoch_10.pth.
2022-04-06 20:49:08,389 - mmdet.ssod - INFO - Best bbox_mAP is 0.0180 at 10 epoch.
2022-04-06 20:49:08,393 - mmdet.ssod - INFO - Exp name: yolox_road_300e.py
2022-04-06 20:49:08,393 - mmdet.ssod - INFO - Epoch(val) [10][172]      bbox_mAP: 0.0180, bbox_mAP_50: 0.0630, bbox_mAP_75: 0.0050, bbox_mAP_s: 0.0060, bbox_mAP_m: 0.0110, bbox_mAP_l: 0.0210, bbox_mAP_copypaste: 0.018 0.063 0.005 0.006 0.011 0.021

igo312 commented 2 years ago

@Tim-Hung I think the error should not raise if you set find_unused_parameters=True

hhaAndroid commented 2 years ago

@igo312 Sorry. I don't find the reason for the error from the log you posted.

igo312 commented 2 years ago

@igo312 Sorry. I don't find the reason for the error from the log you posted.

I did not post any error. I think the error is just because some samples of my dataset have no annotations. But @Tim-Hung got some different result.

hhaAndroid commented 2 years ago

Sorry for late. I found the problem is thers is some annotation is empty on my own data. And it may make reg_conv not participate in the gradient propogation. In other word, SimOTA cannot deal with some img without gt_label.

And yolox seems to have issue at testing stage as well. I will find out what's wrong recentelly.

Thanks a lot, we'll fix it soon.

chentiao commented 2 years ago

我检查了我的所有图片，每张图片都有框，框位置正常，我使用2张2080ti的卡训练，在epoch1中迭代一些次数后就卡着不动了，但是gpu利用率还在浮动；然后我使用一张卡训练，就报出该issue下的问题。然后我使用报错提示的尝试方法find_unused_parameters=True（在mmdet/apis/train.py的if distributed下加入），两张卡下目前训练2个epoch没有报错，继续观察中 2022年06月10日09:55:39 更新我最终发现了问题所在原因没有修改类别数量，不知道为什么labels的数量和设置的类别数量不一致还能训练起来不报错，这是不是一个bug

tanghy2016 commented 2 years ago

I tried yolox, yolov3, setting find_unused_parameters=True didn't work in my case. However, if the learning rate is set to 0.01 or even less, there will be no error (50 epoch); if it is set to 0.1, an error will be reported (a few epochs will report an error).

hhaAndroid commented 1 year ago

The mmdet 3.x/mmyolo branch has been fixed. Please use the latest version

open-mmlab / mmdetection

`find_unused_parameters` after several epoch training when training YOLOX #7298