open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.02k stars 9.36k forks source link

the training process may get into stuck #166

Closed miracle-fmh closed 5 years ago

miracle-fmh commented 5 years ago

After training some iterations, the GPU-Util may increased from about 50% to 100%, and then the training get into totally stuck and can not training any iterations, and the code can not throw any error.

miracle-fmh commented 5 years ago

My System is : CUDA 9 and Python 3.5 and Pytorch 0.4.1

hellock commented 5 years ago

99

miracle-fmh commented 5 years ago

thks, i will try build the pytorch from source

miracle-fmh commented 5 years ago

@hellock I build the pytorch by following steps: git clone -b v0.4.1 --recursive https://github.com/pytorch/pytorch python setup.py install

My system info: python3.5 + cuda9 + cudnn7.0.5 By using the build torch, the training still get into stuck, it cannot train any iterations. 333

can you show me how you build the pytorch, thank you?

hellock commented 5 years ago

Though it is a known issue that PyTorch sometimes gets stuck on V100, your case looks weird. I will take a look tomorrow.

thangvubk commented 5 years ago

In my opinion, it is sometimes related to cuda interations. I have some suggestion for you.

  1. If you are using cuda9, you should upgrade nvidia driver to 396.51 since many people reported to resolve random hang in multi-GPU training.
  2. If option 1 doesnot work, i recommend you to install from a clean docker, nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 can be a good point to start. See #159
miracle-fmh commented 5 years ago

In my opinion, it is sometimes related to cuda interations. I have some suggestion for you.

  1. If you are using cuda9, you should upgrade nvidia driver to 396.51 since many people reported to resolve random hang in multi-GPU training.
  2. If option 1 doesnot work, i recommend you to install from a clean docker, nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 can be a good point to start. See #159

Thanks, I will try it.

miracle-fmh commented 5 years ago

@hellock @thangvubk Thanks, the training process works well after upgrade the nvidia driver to 410.xx.

KimSoybean commented 5 years ago

@miracle-fmh I met the same problem. For Faster-r50-fpn, it throws "UserWarning: semaphore_tracker: There appear to be 8 leaked semaphores to clean up at shutdown." For Mask-r50-fpn, it stops without any error. For Retinanet-r50, it runs properly. I use 4 Tesla P40, CUDA90, pytorch0.4.1 (installed from "pip install pytorch_0.4.1_xxxx.whl "), driver= 410(I have also tried 390, the same situation ) Can you tell me some methods else?

XudongWang12Sigma commented 5 years ago

Hi, I also meet similar problems, have you already solved it? @KimSoybean

yhcao6 commented 5 years ago

When I compile pytorch from source, I have not stuck anymore

ujsyehao commented 5 years ago

Hi, my environment: titan X + 410 Driver + cuda10 + pytorch 1.1 I encounter the problem too.

ujsyehao commented 5 years ago

@thangvubk yeah, you are right, I try new docker image(you recommend) and it works!

visenzeadam commented 5 years ago

The problem is still there: the training got stuck, without any progress after loading dataset.

After I installed driver and toolkit for cuda10.1, installed Pytorch 1.2 from the newest source, followed the guid from nvcc and installed nvcc2, apt-get installed gcc with version 7.4.0. Conda installed some other dependences like mmcv and cython. os: ubuntu 18.04 I tried both ./tools/dist-train.sh and single GPUs train like below.

$ python ./tools/train.py --work_dir work_dirs/ --validate --gpus 4 configs/retinanet_r101_fpn_1x.py 2019-07-19 07:50:51,719 - INFO - Distributed training: False 2019-07-19 07:50:52,319 - INFO - load model from: modelzoo://resnet101 2019-07-19 07:50:53,156 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.19.bn2.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer3.7.bn3.num_batches_tracked, layer3.11.bn1.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer3.16.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer3.15.bn2.num_batches_tracked, layer3.14.bn2.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer3.14.bn1.num_batches_tracked, layer3.12.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer3.6.bn2.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.19.bn1.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer3.8.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer3.10.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer3.13.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.15.bn3.num_batches_tracked, layer3.18.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.16.bn1.num_batches_tracked, layer3.21.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.7.bn1.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.7.bn2.num_batches_tracked, layer1.2.bn2.num_batches_tracked, bn1.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer3.18.bn2.num_batches_tracked, layer3.22.bn2.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.22.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.20.bn3.num_batches_tracked, layer3.13.bn1.num_batches_tracked, layer3.22.bn1.num_batches_tracked, layer3.12.bn3.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.11.bn3.num_batches_tracked, layer3.12.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer3.20.bn2.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.6.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer3.9.bn2.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer3.8.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer3.21.bn1.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.17.bn3.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer3.17.bn2.num_batches_tracked, layer3.8.bn1.num_batches_tracked, layer3.20.bn1.num_batches_tracked, layer3.17.bn1.num_batches_tracked, layer3.14.bn3.num_batches_tracked, layer3.6.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer3.13.bn3.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.16.bn3.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.11.bn2.num_batches_tracked, layer3.10.bn2.num_batches_tracked, layer3.21.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.19.bn3.num_batches_tracked, layer3.18.bn3.num_batches_tracked, layer3.9.bn1.num_batches_tracked, layer3.10.bn3.num_batches_tracked, layer3.15.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.9.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer1.0.bn1.num_batches_tracked

loading annotations into memory... Done (t=0.64s) creating index... index created! 2019-07-19 07:50:56,395 - INFO - Start running, host: adam@adam-train-1, work_dir: /home/adam/code/mmdetection/work_dirs 2019-07-19 07:50:56,397 - INFO - workflow: [('train', 1)], max: 60 epochs

yanxp commented 5 years ago

The problem is still there: the training got stuck, without any progress after loading dataset.

After I installed driver and toolkit for cuda10.1, installed Pytorch 1.2 from the newest source, followed the guid from nvcc and installed nvcc2, apt-get installed gcc with version 7.4.0. Conda installed some other dependences like mmcv and cython. os: ubuntu 18.04 I tried both ./tools/dist-train.sh and single GPUs train like below.

$ python ./tools/train.py --work_dir work_dirs/ --validate --gpus 4 configs/retinanet_r101_fpn_1x.py 2019-07-19 07:50:51,719 - INFO - Distributed training: False 2019-07-19 07:50:52,319 - INFO - load model from: modelzoo://resnet101 2019-07-19 07:50:53,156 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.19.bn2.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer3.7.bn3.num_batches_tracked, layer3.11.bn1.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer3.16.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer3.15.bn2.num_batches_tracked, layer3.14.bn2.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer3.14.bn1.num_batches_tracked, layer3.12.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer3.6.bn2.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.19.bn1.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer3.8.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer3.10.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer3.13.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.15.bn3.num_batches_tracked, layer3.18.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.16.bn1.num_batches_tracked, layer3.21.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.7.bn1.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.7.bn2.num_batches_tracked, layer1.2.bn2.num_batches_tracked, bn1.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer3.18.bn2.num_batches_tracked, layer3.22.bn2.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.22.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.20.bn3.num_batches_tracked, layer3.13.bn1.num_batches_tracked, layer3.22.bn1.num_batches_tracked, layer3.12.bn3.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.11.bn3.num_batches_tracked, layer3.12.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer3.20.bn2.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.6.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer3.9.bn2.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer3.8.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer3.21.bn1.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.17.bn3.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer3.17.bn2.num_batches_tracked, layer3.8.bn1.num_batches_tracked, layer3.20.bn1.num_batches_tracked, layer3.17.bn1.num_batches_tracked, layer3.14.bn3.num_batches_tracked, layer3.6.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer3.13.bn3.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.16.bn3.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.11.bn2.num_batches_tracked, layer3.10.bn2.num_batches_tracked, layer3.21.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.19.bn3.num_batches_tracked, layer3.18.bn3.num_batches_tracked, layer3.9.bn1.num_batches_tracked, layer3.10.bn3.num_batches_tracked, layer3.15.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.9.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer1.0.bn1.num_batches_tracked

loading annotations into memory... Done (t=0.64s) creating index... index created! 2019-07-19 07:50:56,395 - INFO - Start running, host: adam@adam-train-1, work_dir: /home/adam/code/mmdetection/work_dirs 2019-07-19 07:50:56,397 - INFO - workflow: [('train', 1)], max: 60 epochs

Have you solved the problem?

ZlodeiBaal commented 3 years ago

Same issue here. But i found solution for me - pytorch/pytorch#33296 import cv2 At the beginning of the program ( in train.py ), first import cv2, before import torch. Seem like pytorch internal problem.

viet98lx commented 1 year ago

Same issue here. But i found solution for me - pytorch/pytorch#33296 import cv2 At the beginning of the program ( in train.py ), first import cv2, before import torch. Seem like pytorch internal problem.

it works for me too.