microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.49k stars 4.3k forks source link

Distributed Training Error #3419

Open chung1204 opened 5 years ago

chung1204 commented 5 years ago

I want to train VGG16_ImageNet_Distributed.py at multiple node using mpiexec (two gpu on one node) so, I followed instructions in https://docs.microsoft.com/en-us/cognitive-toolkit/Multiple-GPUs-and-machines. When I was training on one node, it worked well. However, it did not work when I trained on multiple node. An error occurred that does not import module numpy or cntk at other node. Since I train through anaconda, i changed default python path to anaconda. But it did not solve the problem.. Is there a solution to this problem??

the errors are as follows : **Traceback (most recent call last): File "/home/cslee/cntk/Examples/Image/Classification/VGG/Python/VGG16_modify_2.py", line 11, in import numpy as np ImportError: No module named numpy Traceback (most recent call last): File "/home/cslee/cntk/Examples/Image/Classification/VGG/Python/VGG16_modify_2.py", line 11, in import numpy as np ImportError: No module named numpy

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[62073,1],2] Exit code: 1**

jaliyae commented 5 years ago

Please check two things.

  1. if all nodes have numpy installed
  2. It is available without activating a special environment, because, when mpid invokes python process, it is run on the default conda environment, not in a specific cntk environment.