[2.18.0] Distributed Training DataLoader Deadlock

csvance commented 3 years ago

Describe the bug Custom COCO format dataset always hangs at fourth DistEvalHook using IterBasedRunner when using distributed training. The exact place that things get stuck is when the evaluation loop tries to get the next batch from the dataloader (for loop). This happens when using >= 1 GPUS, I tested with both 1 and 2 GPU (RTX 2080 Ti) with 0 <= workers_per_gpu <= 12. I also played around with ulimits / shared memory size / OpenCV thread count, nothing makes a difference.

The bug does not happen without distributed training. I think it may be related to this issue, but am not 100% sure: https://github.com/pytorch/pytorch/issues/1355

I am new to using mmcv/mmdetection so maybe I am missing something obvious, but I read the FAQ, looked through issues but didn't find anything definitive.

Was wondering if anyone in the community has a specific docker container they use for experiments and can confirm is working reliably with mmdet such as nVidia's PyTorch NGC, it could be my problem has to do with my system configuration, but this is the first time I have seen this type of problem after running all kinds of multi GPU / multi node experiments and not experiencing any dataloader deadlocks.

Reproduction

What command or script did you run?

# Deadlock
./tools/dist_train.sh configs/xray_calibration/calibration_baseline.py 1

# Deadlock
./tools/dist_train.sh configs/xray_calibration/calibration_baseline.py 2

# No Deadlock
PYTHONPATH=`pwd` python tools/train.py configs/xray_calibration/calibration_base.py --launcher=none --gpus=1

Did you make any modifications on the code or config? Did you understand what you have modified?

My config slightly modifies coco_detection.py config:

Use IterBasedRunner
For my problem we need a different normalization method. I have a custom pipeline module implemented for this (just uses a few numpy operations, nothing fancy).
Use CometML logger not provided in mmcv. I have verified this is not a source of deadlock by logging call_hooks enter and exit.

_base_ = "../_base_/models/faster_rcnn_r50_fpn.py"

# Runtime Settings
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]

# Schedule Settings
TOTAL_STEPS = 1000
LOG_INTERVAL = 10
VAL_STEPS = 100
KEEP_CKPTS = 1

# Optimizer settings
MAX_LR = 0.001
WEIGHT_DECAY = 1e-4

optimizer = dict(type='SGD', lr=MAX_LR, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='OneCycle',
    max_lr=MAX_LR,
    total_steps=TOTAL_STEPS,
    warmup=None
)
runner = dict(type="IterBasedRunner", max_iters=TOTAL_STEPS)
checkpoint_config = dict(by_epoch=False, interval=VAL_STEPS, max_keep_ckpts=KEEP_CKPTS, save_optimizer=False)
evaluation = dict(by_epoch=False, interval=VAL_STEPS)

fp16 = dict(loss_scale="dynamic")

log_config = dict(
    interval=LOG_INTERVAL,
    hooks=[
        dict(type="CometLoggerHook", by_epoch=False,
             workspace="cvance", project="xray-calibration-marker-detector")
    ],
)

train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(394, 512), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='NormalizeAdaptive'),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(394, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='NormalizeAdaptive'),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

data = dict(
    samples_per_gpu=8,
    workers_per_gpu=0,
    train=dict(
        type="CocoDataset",
        classes=('ring',),
        ann_file='json/train.json',
        img_prefix='images/',
        pipeline=train_pipeline,
    ),
    val=dict(
        type="CocoDataset",
        classes=('ring',),
        ann_file='json/val.json',
        img_prefix='images/',
        pipeline=test_pipeline,
    ),
    test=dict(
        type="CocoDataset",
        classes=('ring',),
        ann_file='json/test.json',
        img_prefix='images/',
        pipeline=test_pipeline,
    )
)

What dataset did you use?

Custom COCO format dataset with 1170 images, 0.7/0.2/0.1 train/eval/test split.

Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.

sys.platform: linux
Python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.3.r11.3/compiler.29920130_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4-dev
MMCV: 1.3.17
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.3
MMDetection: 2.18.0+6cf9aa1

You may add addition that may be helpful for locating the problem, such as

Installed PyTorch from Anaconda. I also tried PyTorch 1.8.1 but get the same deadlock as in 1.10.0. Tried both CUDA 10.2 and 11.3.

Error traceback There is no traceback, training simply deadlocks

Bug fix Currently I just do training non distributed, however this is not ideal.

hhaAndroid commented 3 years ago

@csvance We recently fixed the deadlock problem of dataloader in IterBasedRunner, please refer to https://github.com/open-mmlab/mmcv/pull/1442. But your code is the latest, so you can try to increase the sleep time first.

csvance commented 3 years ago

Hi @hhaAndroid, I tried increasing the sleep to 60 seconds but I still get a hang at the same place. I also tried the epoch based runner and get a hang there after a certain number of epochs.

Unfortunately I have not been successful in my efforts to get a stack trace for any thread/process other than the mmcv IterBasedRunner. The good news is for my problem training on a single GPU is sufficient, so I can continue to move forward using mmdetection.

My best guess is there is some sort of bug in PyTorch dataloader/DDP which is causing this problem rather than an issue with the logic of mmcv/mmdetection.

csvance commented 3 years ago

I just realized I had forgotten to configure the number of classes in the ROI head! I tried changing the number of steps to run and ran training non distributed in a debugger and I got an exception in the COCO dataset class! If this is the root cause of my problem, there maybe be something going wrong with exception handling. I will continue to dig into this and update here.

EDIT:

Still get a deadlock with distributed training, but going to keep digging.

Yuting-Gao commented 3 years ago

@csvance

Hello, when I was training with my own dataset, the deadlock phenomenon also occurred. Have you solved it now?

csvance commented 2 years ago

Hi @Yuting-Gao, only way I could avoid it was doing single GPU training (no DDP, single GPU DDP also deadlocks). Luckily my problem is fine tuning with 2000 image dataset, so single GPU is not a problem. Still no idea what the root cause is, but I suspect it has something to do with DDP specifically.

open-mmlab / mmdetection

[2.18.0] Distributed Training DataLoader Deadlock #6486