dist_train.sh unexpectedly hangs after 3-5 epochs

thomas-ames commented 3 years ago

Describe the bug While running distributed training, the script will work fine for 3-5 epochs, then stop running. The GPUs are still active and there is no error or stacktrace provided, but there will be no more output. I cannot tell why it's happening as I've run again and again with the same configuration and environment and the script will stop at irregular intervals. It always seems to be early on, as the latest it has hung is 5 epochs.

Reproduction

./tools/dist_train.sh /home/ec2-user/vfnetx_config.py 8 (The config file is the same as the one in the repo, I just renamed it.)

Did you make any modifications on the code or config? Did you understand what you have modified? I used this config: https://github.com/hyz-xmaster/VarifocalNet/blob/master/configs/vfnet/vfnetx_r2_101_fpn_mdconv_c3-c5_mstrain_59e_coco.py The only difference was the datasets I used (custom COCO datasets)

Environment

sys.platform: linux
Python: 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) [GCC 9.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda-10.1
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
PyTorch: 1.7.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.8.2
OpenCV: 4.5.1
MMCV: 1.2.7
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMDetection: 2.10.0+f459696

hyz-xmaster commented 3 years ago

Hi @thomas-ames, thanks for reporting this problem. I think this issue you ran into is the same with this one #10 which seems to have been solved by @oym050922021 .

Hi, @oym050922021, could you please share your solution to this problem to help @thomas-ames fix it ? Thank you.

oym050922021 commented 3 years ago

hi, sorry, I haven't solved the problem yet.

At 2021-03-27 09:06:55, "hyz-xmaster" @.***> wrote:

Hi @thomas-ames, thanks for reporting this problem. I think this issue you ran into is the same with this one #10 which seems to have been solved by @oym050922021 .

Hi, @oym050922021, could you please share your solution to this problem to help @thomas-ames fix it ? Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

hyz-xmaster commented 3 years ago

You may need to upgrade the Nvidia driver according to this answer.

hyz-xmaster / VarifocalNet

dist_train.sh unexpectedly hangs after 3-5 epochs #12