microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.53k stars 4.28k forks source link

Distributed cntk stuck #3194

Open csurbhi opened 6 years ago

csurbhi commented 6 years ago

I have compiled cntk from scratch and installed the prerequisites.

I am using the command from the https://github.com/Microsoft/CNTK/tree/master/Examples/Image/Classification/ResNet/Python and https://docs.microsoft.com/en-us/cognitive-toolkit/Multiple-GPUs-and-machines#42-running-parallel-training-with-python

I am facing two problems: 1) The commands are getting stuck 2) Though I configured cntk with OpenCV location to be /usr/local/opencv-3.1.0/, it seems to not find opencv when executing.

Here is the exact command that gets stuck: /usr/local/mpi/bin/mpiexec -d -n 2 -hostfile hostfile python /home/user/cntk/cntk/Examples/Image/Classification/ResNet/Python/TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 1 -s True -e 1

Copy pasting the output here:

[ml-factory:26758] procdir: /tmp/ompi.ml-factory.1000/pid.26758/0/0 [ml-factory:26758] jobdir: /tmp/ompi.ml-factory.1000/pid.26758/0 [ml-factory:26758] top: /tmp/ompi.ml-factory.1000/pid.26758 [ml-factory:26758] top: /tmp/ompi.ml-factory.1000 [ml-factory:26758] tmp: /tmp [ml-factory:26758] sess_dir_cleanup: job session dir does not exist [ml-factory:26758] sess_dir_cleanup: top session dir not empty - leaving [ml-factory:26758] procdir: /tmp/ompi.ml-factory.1000/pid.26758/0/0 [ml-factory:26758] jobdir: /tmp/ompi.ml-factory.1000/pid.26758/0 [ml-factory:26758] top: /tmp/ompi.ml-factory.1000/pid.26758 [ml-factory:26758] top: /tmp/ompi.ml-factory.1000 [ml-factory:26758] tmp: /tmp [ml-factory:26758] mpiexec: reset PATH: /usr/local/mpi/bin:/home/user/anaconda2/envs/cntk-py27/bin:/home/user/cntk/cntk/bin/:/home/user/bin:/home/user/.local/bin:/home/user/anaconda2/bin:/usr/local/mpi/bin:/usr/local/cuda-9.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin [ml-factory:26758] mpiexec: reset LD_LIBRARY_PATH: /usr/local/mpi/lib:/usr/local/lib:/usr/local/cuda-9.1/lib64 [anandj:27699] procdir: /tmp/ompi.anandj.1001/jf.27559/0/1 [anandj:27699] jobdir: /tmp/ompi.anandj.1001/jf.27559/0 [anandj:27699] top: /tmp/ompi.anandj.1001/jf.27559 [anandj:27699] top: /tmp/ompi.anandj.1001 [anandj:27699] tmp: /tmp [anandj:27699] sess_dir_cleanup: job session dir does not exist [anandj:27699] sess_dir_cleanup: top session dir not empty - leaving [anandj:27699] procdir: /tmp/ompi.anandj.1001/jf.27559/0/1 [anandj:27699] jobdir: /tmp/ompi.anandj.1001/jf.27559/0 [anandj:27699] top: /tmp/ompi.anandj.1001/jf.27559 [anandj:27699] top: /tmp/ompi.anandj.1001 [anandj:27699] tmp: /tmp [ml-factory:26758] [[27559,0],0] Releasing job data for [INVALID] /home/user/anaconda2/envs/cntk-py27/lib/python2.7/site-packages/cntk/cntk_py_init.py:102: UserWarning:

################################################ Missing optional dependency ( OpenCV ) ################################################ CNTK may crash if the component that depends on those dependencies is loaded. Visit https://docs.microsoft.com/en-us/cognitive-toolkit/Setup-Linux-Python#optional-opencv for more information. ############################################################################################################################################

warnings.warn(WARNING_MSG % (' OpenCV ', 'https://docs.microsoft.com/en-us/cognitive-toolkit/Setup-Linux-Python#optional-opencv')) Start training: quantize_bit = 1, epochs = 1, distributed_after = 0 [ml-factory:26764] procdir: /tmp/openmpi-sessions-user@ml-factory_0/27559/1/0 [ml-factory:26764] jobdir: /tmp/openmpi-sessions-user@ml-factory_0/27559/1 [ml-factory:26764] top: openmpi-sessions-user@ml-factory_0 [ml-factory:26764] tmp: /tmp


What am I missing here?

Also this command works fine: /usr/local/mpi/bin/mpiexec -hostfile hostfile python /home/user/cntk/cntk/Examples/Image/Classification/ResNet/Python/TrainResNet_CIFAR10.py -n resnet20 -e 1 -o /tmp/output/

ke1337 commented 6 years ago

Please use 2.5.1 to avoid dependency missing issue. For distributed training, are you running on multiple machines? For single machine with multiple GPUs, you don't need -hostfile option to mpiexec.