uqfoundation / pyina

MPI parallel map and cluster scheduling
http://pyina.rtfd.io
Other
61 stars 8 forks source link

mpirun jobs only run on node 0 with scatter_gather.py #11

Closed mmckerns closed 10 years ago

mmckerns commented 10 years ago

scatter_gather.py should get this... (and does on linux cluster with python 2.6.4)

dude@shc-a>$ mpirun -np 2 python scatter_gather.py
Input:
 [node 0] [ 0.  1.  2.]
 [node 1] [ 3.  4.  5.]
Running on 2 cores...
Output:
 [ 0.          0.70807342  0.82682181  0.01991486  0.57275002  0.91953576]

HOWEVER, on mac OSX with python 2.7.8, it gets this...

dude@hilbert>$ mpirun -np 2 python2.7 scatter_gather.py
Input:
 [node 0] [ 0.  1.  2.]
Running on 1 cores...
Output:
 [ 0.          0.70807342  0.82682181]
Input:
 [node 0] [ 0.  1.  2.]
Running on 1 cores...
Output:
 [ 0.          0.70807342  0.82682181]

Should check which mpirun is being run, and which version is installed. Also check for path issues, and if some MPI daemon needs to be started first.

mmckerns commented 10 years ago

a more succinct example is pyina/examples_other/nodes.py:

dude@shc-b>$ mpirun -np 2 python nodes.py 
Node (1) of 2 
Node (0) of 2 

versus

dude@hilbert>$ mpirun -np 2 python2.7 nodes.py 
Node (0) of 1 
Node (0) of 1 
mmckerns commented 10 years ago

As noted here: http://www.open-mpi.org/community/lists/users/2012/09/20328.php

Apparently, if all your outputs say "I am process 0 of 1", this typically means you've got a mismatch between the openmpi version that you compiled mpi4py with and the mpirun that you used to launch it. You may even have compiled mpi4py against mpich, but used the mpirun from openmpi to launch it. That can easily lead to side effects like saying that your program exited incorrectly.

Conclusion, you need to be very careful to use the exact same version of openmpi (or mpich) to both compile mpi4py and mpirun whatever python program you are running.

Upon rebuilding mpi4py with mpich, and making sure the links were going to the correct executable… it now works on the mac. No changes need to the code, so not a bug.