a dataset bug causing topdown training very slow, wasting 3 min every epoch

WeianMao commented 3 years ago

i found a dataset bug, i test it on several server(including 8 a100 with 96 core cpu), it all happened. for every epoch, this bug cause about 3min time wasting. i jsut can locat the bug, but i don't known why it happen. it seems only happen when distribution launching.

bug loaction: when you lauch a topdown method, eg, topdown_heatmap/coco/res50_coco_256x192.py, go to /mmcv/runner/epoch_based_runner.py, about line 48. there is such func

    self.call_hook('before_train_epoch')
    time.sleep(2)  # Prevent possible deadlock during epoch transition
    for i, data_batch in enumerate(self.data_loader):
        self._inner_iter = i

at the every epoch begining, the ( for i, data_batch in enumerate(self.data_loader): ) takes about 3min, it make the training very slow.

you can modify the ori code to the code below to reproduce this issue, this only happen at very epoch begining.

    self.call_hook('before_train_epoch')
    time.sleep(2)  # Prevent possible deadlock during epoch transition
    print('before_train_epoch, time:{}'.format(time.time()-start_time))
    start_time = time.time()
    for i, data_batch in enumerate(self.data_loader):
        self._inner_iter = i
        print('before_train_iter_load_data, time:{}'.format(time.time()-start_time))

here is my sys information Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
CUDA available: True GPU 0,1,2,3,4,5,6,7: A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Build cuda_11.1.TC455_06.29190527_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;- gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -We xtra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-ps abi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -f no-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA= ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.1+cu111
OpenCV: 4.5.3
MMCV: 1.3.8
MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 11.1 MMPose: 0.15.0+51b4b45

jin-s13 commented 3 years ago

https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader

In Pytorch > 1.7, there is a new feature called persistent_workers (Default: False). If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

We will support this feature, like https://github.com/open-mmlab/mmocr/pull/459

WeianMao commented 3 years ago

https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader

In Pytorch > 1.7, there is a new feature called persistent_workers (Default: False). If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

We will support this feature, like open-mmlab/mmocr#459

so, if i set the persistent_workers to True, this issue will be fixed? thanks

jin-s13 commented 3 years ago

Please have a try.

WeianMao commented 3 years ago

Please have a try.

yes, it works. thank you very much! the time reduce from 180s to 2s. i think you can update this thing to mmpsoe, i will speed up the training significantly on the fast server.

WeianMao commented 3 years ago

Please have a try.

BTW, i think the flag, (find_unused_parameters = cfg.get('find_unused_parameters', True)), should be set as False.

jin-s13 commented 3 years ago

Setting find_unused_parameters=True will prevent DDP from waiting for absent gradients forever during the backward pass. In most cases, I think it is worth introducing some extra overheads (traversing the autograd graph) to avoid being stuck.

https://pytorch.org/docs/stable/notes/ddp.html

WeianMao commented 3 years ago

Setting find_unused_parameters=True will prevent DDP from waiting for absent gradients forever during the backward pass. In most cases, I think it is worth introducing some extra overheads (traversing the autograd graph) to avoid being stuck.

https://pytorch.org/docs/stable/notes/ddp.html

if i'm right, there will be a error when some model parameter are not in the autograd gragh (find_unused_parameters=False). if we set the find_unused_parameters=True, the error will disappear. however, it is a model bug that some model parameter are not in the autograd gragh. in such case, i think we should know bug. and previously, if find_unused_parameters=True, the menory consumption will be larger and training will be slow down. (but in mmpose, it seems like the menory and speed is the same)

open-mmlab / mmpose

a dataset bug causing topdown training very slow, wasting 3 min every epoch #923