Closed Helveg closed 4 years ago
Actually, it's the other way.
h.nrnmpi_init()
was introduced in NEURON 7.7 to eliminate the need for the mpi4py
module in parallel NEURON simulations.
Ok! Will it be possible to have it play nice when MPI is already initialized?
@ramcdougal The issue becomes more worrisome: when importing mpi4py and with h.nrnmpi_init()
there's MPI init errors, without h.nrnmpi_init()
ParallelContext().nhost()
returns 1 while MPI.COM_WORLD.size
correctly returns 48.
@pramodk Maybe you know more about the specifics on Piz Daint?
But as it stands it seems that NEURON can't reliably function if mpi4py is imported.
When mpi4py
is imported after h.nrnmpi_init
the same stalling behavior is observed, so I can't use mpi4py in any way in combination with NEURON on Piz Daint.
I agree that it should be robust to this, but...
Using mpi4py
to initialize has exactly the same effect on NEURON as doing h.nrnmpi_init()
. These are two separate ways of doing the same thing. (The second is supported in part because we can't control whether or not a system has mpi4py
installed.) There is never a reason to do both on purpose.
So how then can you be safe even if you don't know what your users have done?
We know from #428 that doing the h.nrnmpi_init()
before importing mpi4py
is safe. Importing mpi4py
repeatedly is also safe. Therefore, if you need to do parallel simulations, you can try importing mpi4py
first for the initialization. This should always be safe, but the import could fail if mpi4py
is not installed. In the case that the import fails, you know that the user did not separately import mpi4py
(because it doesn't exist), and thus you can safely enable parallel simulation with h.nrnmpi_init()
.
See that's what I thought @ramcdougal but when I do this:
from mpi4py import MPI;
print(MPI.COMM_WORLD.size);
from neuron import h;
print(h.ParallelContext().nhost());
It prints 48
and 1
48 times. While when I do this:
from neuron import h;
h.nrnmpi_init();
print(h.ParallelContext().nhost());
It prints 48
48 times, but then I can't import mpi4py anymore and I need access to its functions such as Barrier
(and I'm writing code for multiple simulator backends so using NEURON's replacements of it is not an option)
Does it work if you use h.nrnmpi_init()
instead?
Otherwise, it's possible that you have a version of NEURON that has been compiled without MPI support. (By default the autotools installation doesn't enable parallel simulation; I'm not sure about the default cmake installation.)
Weakening my previous statement:
If the user, for some reason, imported neuron first, then it is too late to use mpi4py
to initialize MPI for NEURON... so in that case, you'd need to use h.nrnmpi_init()
because it actually does two things: it initializes MPI, and it lets NEURON know that MPI has been initialized.
@ramcdougal I edited my previous post. NEURON was compiled with MPI support as described in https://github.com/neuronsimulator/nrn/issues/577#issuecomment-637742025
We know from #428 that doing the
h.nrnmpi_init()
before importingmpi4py
is safe
Actually it isn't: Importing mpi4py
after h.nrnmpi_init()
leads to MPI init errors such as in #485 on Ubuntu with OpenMPI and to either indefinite stalling or errors like below on Piz Daint:
bp000347@daint107:/scratch/snx3000/bp000347> srun python3 -c "from neuron import h; h.nrnmpi_init(); from mpi4py import MPI; print(h.ParallelContext().nhost()); print(MPI.COMM_WORLD.size)"
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Rank 1 [Thu Jun 4 15:19:55 2020] [c2-0c1s0n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
Rank 0 [Thu Jun 4 15:19:55 2020] [c2-0c1s0n0] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
Rank 3 [Thu Jun 4 15:19:55 2020] [c2-0c1s0n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
Rank 2 [Thu Jun 4 15:19:55 2020] [c2-0c1s0n2] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
srun: error: nid00449: task 1: Aborted (core dumped)
srun: Terminating job step 23046994.0
srun: error: nid00448: task 0: Aborted (core dumped)
srun: error: nid00451: task 3: Aborted (core dumped)
srun: error: nid00450: task 2: Aborted (core dumped)
Whether it errors or stalls seems to depend on the amount of nodes I use
To reiterate and expand from one of my comments to #428. It is our desire to make mpi work correctly under the following conditions. Any machine supporting mpi. Launch python or nrniv. MPI does or does not exist on the machine. One or both mpi4py and h.nrnmpi_init() imported in either order. MPI statically or dynamically linked to nrniv, or dynamically loaded after launch. NEURON_INIT_MPI environment variable does not exist or exists with a value of 0 or 1.
I gather that these desires are not all satisfied on piz daint. I notice that on my machine with test1.py
$ cat test1.py
from mpi4py import MPI;
print(MPI.COMM_WORLD.size);
from neuron import h;
print(h.ParallelContext().nhost());
that
$ mpiexec -n 2 python test1.py
2
2
numprocs=2
2
2
and
$ cat test2.py
from neuron import h;
h.nrnmpi_init();
print(h.ParallelContext().nhost());
from mpi4py import MPI
print(MPI.COMM_WORLD.size);
pc = h.ParallelContext()
pc.barrier()
h.quit()
hines@hines-T7500:~/neuron/nrncmake/build$ mpiexec -n 2 python test2.py
numprocs=2
2
2
2
2
hines@hines-T7500:~/neuron/nrncmake/build$
I added the last three lines in test2.py to avoid a
hines@hines-T7500:~/neuron/nrncmake/build$ mpiexec -n 2 python test3.py
numprocs=2
2
2
2
2
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 0 on
node hines-T7500 exiting improperly. There are three reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
You can avoid this message by specifying -quiet on the mpiexec command line.
--------------------------------------------------------------------------
I think the reason was number 2 above
@nrnhines Correct, there are certain machines (or MPI implementations) where none of this is an issue. For example on my Windows machine I can also execute both test1.py
and test2.py
correctly, but on Linux and its many MPI implementations it seems to me that the MPI machinery of NEURON sometimes struggles.
A good indication of "health" in the MPI machinery might be this:
nrnmpi_init
can detect whether MPI has already been initialized from any source (I don't know if this is technically possible, but mpi4py seems to manage it)nrnmpi_init
and mpi4py
can be used together in any orderAnd perhaps unit tests should be included in the CI pipelines that test NEURON's coexistence with the most common MPI implementations (OpenMPI, MPICH, CrayMPI, ...).
Do these seem like reasonable benchmarks for the MPI features?
I'm sorry that I'm always complaining, opening issues and never contributing but I haven't written a word of C code in my life :)
Ok, the next step is to debug on piz daint. It is currently a puzzle from review of the code in nrnmpi.c how MPI_Initialize can be called if it has already been initialized or vice versa
MPI_Initialized(&flag);
/* only call MPI_Init if not already initialized */
if (!flag) {
#if (USE_PTHREAD)
...
asrt(MPI_Init_thread(pargc, pargv, required, &provided));
#else
asrt(MPI_Init(pargc, pargv));
#endif
nrnmpi_under_nrncontrol_ = 1;
} else {
nrnmpi_under_nrncontrol_ = 0;
}
@pramodk Is it a problem that (edit: originally copy/pasted the wrong ldd command)
hines@daint106:~/nrn/build> ldd ../install/lib/libnrniv.so
...
libmpich_gnu_82.so.3 => /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.2/lib/libmpich_gnu_82.so.3 (0x00002b034dc94000)
...
but
hines@daint106:~/nrn/build> ldd /opt/python/3.6.5.7/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so
...
libmpich_gnu_71.so.3 => /opt/cray/pe/lib64/libmpich_gnu_71.so.3 (0x00002ba461a88000)
...
My symptom is that MPI_Initialize(&flag) is returning 0 even after from mpi4py import MPI
Note that I built with source nrnenv
where
hines@daint106:~> cat nrnenv
module swap PrgEnv-cray PrgEnv-gnu
module load daint-mc
module load cray-python/3.6.5.7 PyExtensions/3.6.5.7-CrayGNU-19.10
export CRAYPE_LINK_TYPE=dynamic
export PYTHONPATH=$HOME/nrn/install/lib/python:$PYTHONPATH
export PATH=$HOME/nrn/install/bin:$PATH
export PYTHONHOME="/opt/python/3.6.5.7"
export NRN_PYLIB="/opt/python/3.6.5.7/lib/libpython3.6m.so.1.0"
and a cmake of
cmake .. -DCMAKE_INSTALL_PREFIX=../install -DNRN_ENABLE_INTERVIEWS=OFF
It doesn't only stall on mpi4py
, it also stalls on import h5py
. This is great debugging information!
Both of these packages come from modules, compiled by the admins of Piz Daint: so @nrnhines you're probably on to something here (https://github.com/neuronsimulator/nrn/issues/581#issuecomment-639030866)
I load the following relevant modules that provide h5py
and mpi4py
:
28) cray-python/3.6.5.7(default)
29) PyExtensions/3.6.5.7-CrayGNU-19.10
30) cray-hdf5-parallel/1.10.5.1(default)
31) h5py/2.8.0-CrayGNU-19.10-python3-parallel
Maybe by inspecting how they were built vs how we both built NEURON can lead to at least a temporary solution.
Here is the full list of loaded modules:
1) modules/3.2.11.3(default)
2) cray-mpich/7.7.10(default)
3) slurm/20.02.2-1
4) xalt/2.7.24
5) daint-mc
6) cray-python/2.7.15.7
7) pip/20.0.2-py3
8) gcc/8.3.0(default)
9) craype-broadwell
10) craype-network-aries
11) craype/2.6.1(default)
12) cray-libsci/19.06.1(default)
13) udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari
14) ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari
15) pmi/5.0.14(default)
16) dmapp/7.1.1-7.0.1.1_4.8__g38cf134.ari
17) gni-headers/5.0.12.0-7.0.1.1_6.7__g3b1768f.ari
18) xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari
19) job/2.2.4-7.0.1.1_3.8__g36b56f4.ari
20) dvs/2.12_2.2.151-7.0.1.1_5.6__g7eb5e703
21) alps/6.6.56-7.0.1.1_4.10__g2e60a7e4.ari
22) rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari
23) atp/2.1.3(default)
24) perftools-base/7.1.1(default)
25) PrgEnv-gnu/6.0.5
26) cdt/19.10
27) CrayGNU/.19.10
28) cray-python/3.6.5.7(default)
29) PyExtensions/3.6.5.7-CrayGNU-19.10
30) cray-hdf5-parallel/1.10.5.1(default)
31) h5py/2.8.0-CrayGNU-19.10-python3-parallel
I haven't gone through all comments in this issue discussion (yet) but would like to mention that the MPI initialisation is bit complex topic in neuron workflow because of various use cases we have : python/nrniv launching, external/internal MPI initialisation, dynamic/non-dynamic MPI build etc. Going back to the issue itself on Piz-Daint:
As Michael pointed out, the MPI libraries linked by NEURON and MPI4PY are different. Actually, it's same cray-mpich but different GNU toolchains are used to compile them:
bp000174@daint107:~> ll /opt/cray/pe/lib64/libmpich_gnu_71.so.3
lrwxrwxrwx 1 root root 66 Oct 25 2019 /opt/cray/pe/lib64/libmpich_gnu_71.so.3 -> /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/7.1/lib/libmpich_gnu_71.so.3
bp000174@daint107:~> ll /opt/cray/pe/lib64/libmpich_gnu_82.so.3
lrwxrwxrwx 1 root root 66 Oct 25 2019 /opt/cray/pe/lib64/libmpich_gnu_82.so.3 -> /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.2/lib/libmpich_gnu_82.so.3
My yesterday's NEURON installation from #577 links to gnu_82
(because of GCC 8.3) and MPI4PY is with gnu_71
(GCC v7):
bp000174@daint105:~> ldd ~/install/lib/libnrniv.so
linux-vdso.so.1 (0x00007ffc63a7d000)
libreadline.so.7 => /lib64/libreadline.so.7 (0x00002ba831fbf000)
libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002ba83220e000)
librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002ba832749000)
libmpich_gnu_82.so.3 => /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.2/lib/libmpich_gnu_82.so.3 (0x00002ba83294d000)
libm.so.6 => /lib64/libm.so.6 (0x00002ba832f0f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ba833247000)
libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00002ba833465000)
libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002ba8337ee000)
libc.so.6 => /lib64/libc.so.6 (0x00002ba833a06000)
libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00002ba833dc0000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ba833fee000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002ba8341f2000)
libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002ba8343f5000)
librt.so.1 => /lib64/librt.so.1 (0x00002ba8345f8000)
libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002ba834800000)
libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002ba834a84000)
libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002ba834c8e000)
libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00002ba834ed7000)
libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002ba835346000)
/lib64/ld-linux-x86-64.so.2 (0x00002ba831884000)
libz.so.1 => /lib64/libz.so.1 (0x00002ba835586000)
bp000174@daint105:~> ldd /opt/python/3.6.5.7/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007ffe663f9000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b17ec8f1000)
libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002b17ecaf5000)
librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002b17ed030000)
libmpich_gnu_71.so.3 => /opt/cray/pe/lib64/libmpich_gnu_71.so.3 (0x00002b17ed234000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b17ed7f6000)
libc.so.6 => /lib64/libc.so.6 (0x00002b17eda14000)
/lib64/ld-linux-x86-64.so.2 (0x00002b17ec390000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b17eddce000)
libm.so.6 => /lib64/libm.so.6 (0x00002b17edfd1000)
libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002b17ee309000)
librt.so.1 => /lib64/librt.so.1 (0x00002b17ee50c000)
libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002b17ee714000)
libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002b17ee998000)
libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002b17eeba2000)
libgfortran.so.4 => /opt/cray/pe/gcc-libs/libgfortran.so.4 (0x00002b17eedeb000)
libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002b17ef1bf000)
libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002b17ef3d7000)
So when I run test provided, I can reproduce the issue:
bp000174@daint105:~> srun python test.py
2
1
2
1
Ok. So let's now force the loading of libmpich_gnu_82.so.3
for both NEURON and MPI4PY:
bp000174@daint105:~> export LD_PRELOAD=/opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.1/lib/libmpich_gnu_82.so.3
bp000174@daint105:~> srun python test.py
2
2
2
2
numprocs=2
Looks good!
So, how this problem can be solved? Let's install mpi4py from source with the current default GNU toolchain 8.3 and cray-mpich module loaded:
wget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.3.tar.gz
tar -xvzf mpi4py-3.0.3.tar.gz
cd mpi4py-3.0.3/
python setup.py build --mpicc=`which cc`
python setup.py install --user
Check library links:
bp000174@daint105:~> ldd /users/bp000174/.local/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007ffd533ee000)
/opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.1/lib/libmpich_gnu_82.so.3 (0x00002b94cdef7000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b94ce4b9000)
libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002b94ce6bd000)
librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002b94cebf8000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b94cedfc000)
libc.so.6 => /lib64/libc.so.6 (0x00002b94cf01a000)
libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002b94cf3d4000)
librt.so.1 => /lib64/librt.so.1 (0x00002b94cf5d7000)
libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002b94cf7df000)
libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002b94cfa63000)
libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002b94cfc6d000)
libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00002b94cfeb6000)
libm.so.6 => /lib64/libm.so.6 (0x00002b94d0325000)
libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002b94d065d000)
libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002b94d0875000)
/lib64/ld-linux-x86-64.so.2 (0x00002b94cd954000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b94d0ab5000)
libz.so.1 => /lib64/libz.so.1 (0x00002b94d0cb8000)
good!
Try again test:
bp000174@daint105:~> unset LD_PRELOAD
bp000174@daint105:~> srun python test.py
2
2
2
2
numprocs=2
all good!
Let's do other way around : build NEURON with GNU toolchain 7.3 and make sure everyone links with libmpich_gnu_71.so.3
.
module swap gcc/8.3.0 gcc/7.3.0
# from new build dir
cmake .. -DCMAKE_INSTALL_PREFIX=`pwd`/install -DNRN_ENABLE_INTERVIEWS=OFF
make -j
make install
Check if library now links with libmpich_gnu_71.so.3
:
bp000174@daint105:~/nrn/build_gcc7> ldd /users/bp000174/nrn/build_gcc7/install/lib/libnrniv.so
linux-vdso.so.1 (0x00007fffe9f9c000)
libreadline.so.7 => /lib64/libreadline.so.7 (0x00002b11614af000)
libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002b11616fe000)
librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002b1161c39000)
libmpich_gnu_71.so.3 => /opt/cray/pe/lib64/libmpich_gnu_71.so.3 (0x00002b1161e3d000)
libm.so.6 => /lib64/libm.so.6 (0x00002b11623ff000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b1162737000)
libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00002b1162955000)
libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002b1162cde000)
libc.so.6 => /lib64/libc.so.6 (0x00002b1162ef6000)
libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00002b11632b0000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b11634de000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b11636e2000)
libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002b11638e5000)
librt.so.1 => /lib64/librt.so.1 (0x00002b1163ae8000)
libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002b1163cf0000)
libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002b1163f74000)
libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002b116417e000)
libgfortran.so.4 => /opt/cray/pe/gcc-libs/libgfortran.so.4 (0x00002b11643c7000)
libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002b116479b000)
/lib64/ld-linux-x86-64.so.2 (0x00002b1160d77000)
And re-run the test to see if it now works with mpi4py present on the system and if it's compatible with NEURON:
# make sure to remove your new mpi4py
bp000174@daint105:~> rm -rf $HOME/.local/lib/python3.6/site-packages/mpi4py*
bp000174@daint105:~> export PYTHONPATH=/users/bp000174/nrn/build_gcc7/install/lib/python:$PYTHONPATH
bp000174@daint105:~> srun python test.py
2
2
2
2
numprocs=2
In summary, to avoid surprises, we should link to the same MPI library for all software components being used together.
Hope this helps!
I'd like to propose the hypothesis that if there are two conceptually the same shared libraries that differ only in library name, they can both be loaded but will not share statiic data. E.g. that can explain why mpi4py can MPI_Init but nrnmpi can then call MPI_Initialized(&flag) and flag can be 0. In our case I believe the different mpi libaries were linked against at build time. I doubt it is worth it, but this could be avoided by using -DNRN_ENABLE_MPI_DYNAMIC=ON . Then it would be possible to check to see if an mpi library is already loaded (assuming mpi4py was imported first) and dlopen that one.
This reminds me of an otherwise unrelated issue currently on the table where it seems that the current linux wheel for NEURON has an old readline linked into the libnrniv.so and that seems to get mixed in with the system readline loaded by python and perhaps is the cause of a problem when pdb is used in that typing to the console (when python is not run interactively) gives a segmentation error with a __GI__IO_putc (c=13, fp=0x0) at putc.c:28
In our case I believe the different mpi libaries were linked against at build time. I doubt it is worth it, but this could be avoided by using -DNRN_ENABLE_MPI_DYNAMIC=ON . Then it would be possible to check to see if an mpi library is already loaded (assuming mpi4py was imported first) and dlopen that one.
Due to the complexity and failure possibilities. I also think it's not worth it. It would be good to keep things simple (actually, today it's already complex 😃 ).
This reminds me of an otherwise unrelated issue currently on the table where it seems that the current linux wheel for NEURON has an old readline linked into the libnrniv.so and that seems to get mixed in with the system readline loaded by python and perhaps is the cause of a problem when pdb is used in that typing to the console (when python is not run interactively) gives a segmentation error with a GIIO_putc (c=13, fp=0x0) at putc.c:28
The readline issue is quite specific because when we create a wheel, we use auditwheel
program and it complaints if unwanted libraries like readline is linked. As this is non-mpi issue and @Helveg is installing from source, it won't affect them.
@Helveg : As the issue is explained (and specific to daint software setup), I will close this. Feel free to reopen if you have more questions!
Ok, is there any open issue tracking the robustness of h.nrnmpi_init
?
I'm not aware of one.
I've tested NEURON and MPI across several machines, and it has always felt very fragile if you include
h.nrnmpi_init
, there seems to be no surefire way of deploying NEURON & MPI across all target systems that can avoid dreadful MPI init errors or strange behaviors.The latest is Piz Daint, where a few simple lines either stalls the script indefinitly:
@ramcdougal @nrnhines I have understood from https://github.com/neuronsimulator/nrn/issues/428#issuecomment-587184064 that
h.nrnmpi_init()
is obsolete ifmpi4py
is imported first. But it's not just obsolete it (sometimes) bulldozes any MPI initialisation that might've already occurred leading to for example #485, #428 and now this stalling behavior.It's not something I can't work around in my libraries, for example I can try importing
mpi4py
and if it isn't installed I'll have neuron do the MPI init. But that's not a very waterproof solution, for example on Piz Daint and probably other HPC the MPI is initialized in job contexts already. So my question: Could it be possible to fix this "init overriding" behavior ofh.nrnmpi_init()
so that it is safe to call in all contexts?