h.nrnmpi_init() causes various MPI init errors

Helveg commented 4 years ago

I've tested NEURON and MPI across several machines, and it has always felt very fragile if you include h.nrnmpi_init, there seems to be no surefire way of deploying NEURON & MPI across all target systems that can avoid dreadful MPI init errors or strange behaviors.

The latest is Piz Daint, where a few simple lines either stalls the script indefinitly:

from mpi4py import MPI
from neuron import h
h.nrnmpi_init()

@ramcdougal @nrnhines I have understood from https://github.com/neuronsimulator/nrn/issues/428#issuecomment-587184064 that h.nrnmpi_init() is obsolete if mpi4py is imported first. But it's not just obsolete it (sometimes) bulldozes any MPI initialisation that might've already occurred leading to for example #485, #428 and now this stalling behavior.

It's not something I can't work around in my libraries, for example I can try importing mpi4py and if it isn't installed I'll have neuron do the MPI init. But that's not a very waterproof solution, for example on Piz Daint and probably other HPC the MPI is initialized in job contexts already. So my question: Could it be possible to fix this "init overriding" behavior of h.nrnmpi_init() so that it is safe to call in all contexts?

ramcdougal commented 4 years ago

Actually, it's the other way.

h.nrnmpi_init() was introduced in NEURON 7.7 to eliminate the need for the mpi4py module in parallel NEURON simulations.

Helveg commented 4 years ago

Ok! Will it be possible to have it play nice when MPI is already initialized?

Helveg commented 4 years ago

@ramcdougal The issue becomes more worrisome: when importing mpi4py and with h.nrnmpi_init() there's MPI init errors, without h.nrnmpi_init() ParallelContext().nhost() returns 1 while MPI.COM_WORLD.size correctly returns 48.

@pramodk Maybe you know more about the specifics on Piz Daint?

But as it stands it seems that NEURON can't reliably function if mpi4py is imported.

When mpi4py is imported after h.nrnmpi_init the same stalling behavior is observed, so I can't use mpi4py in any way in combination with NEURON on Piz Daint.

ramcdougal commented 4 years ago

I agree that it should be robust to this, but...

Using mpi4py to initialize has exactly the same effect on NEURON as doing h.nrnmpi_init(). These are two separate ways of doing the same thing. (The second is supported in part because we can't control whether or not a system has mpi4py installed.) There is never a reason to do both on purpose.

So how then can you be safe even if you don't know what your users have done?

We know from #428 that doing the h.nrnmpi_init() before importing mpi4py is safe. Importing mpi4py repeatedly is also safe. Therefore, if you need to do parallel simulations, you can try importing mpi4py first for the initialization. This should always be safe, but the import could fail if mpi4py is not installed. In the case that the import fails, you know that the user did not separately import mpi4py (because it doesn't exist), and thus you can safely enable parallel simulation with h.nrnmpi_init().

Helveg commented 4 years ago

See that's what I thought @ramcdougal but when I do this:

from mpi4py import MPI; 
print(MPI.COMM_WORLD.size); 
from neuron import h; 
print(h.ParallelContext().nhost());

It prints 48 and 1 48 times. While when I do this:

from neuron import h;
h.nrnmpi_init();
print(h.ParallelContext().nhost());

It prints 48 48 times, but then I can't import mpi4py anymore and I need access to its functions such as Barrier (and I'm writing code for multiple simulator backends so using NEURON's replacements of it is not an option)

ramcdougal commented 4 years ago

Does it work if you use h.nrnmpi_init() instead?

Otherwise, it's possible that you have a version of NEURON that has been compiled without MPI support. (By default the autotools installation doesn't enable parallel simulation; I'm not sure about the default cmake installation.)

Weakening my previous statement:

If the user, for some reason, imported neuron first, then it is too late to use mpi4py to initialize MPI for NEURON... so in that case, you'd need to use h.nrnmpi_init() because it actually does two things: it initializes MPI, and it lets NEURON know that MPI has been initialized.

Helveg commented 4 years ago

@ramcdougal I edited my previous post. NEURON was compiled with MPI support as described in https://github.com/neuronsimulator/nrn/issues/577#issuecomment-637742025

Helveg commented 4 years ago

We know from #428 that doing the h.nrnmpi_init() before importing mpi4py is safe

Actually it isn't: Importing mpi4py after h.nrnmpi_init() leads to MPI init errors such as in #485 on Ubuntu with OpenMPI and to either indefinite stalling or errors like below on Piz Daint:

bp000347@daint107:/scratch/snx3000/bp000347> srun python3 -c "from neuron import h; h.nrnmpi_init(); from mpi4py import MPI; print(h.ParallelContext().nhost()); print(MPI.COMM_WORLD.size)"
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Rank 1 [Thu Jun  4 15:19:55 2020] [c2-0c1s0n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
Rank 0 [Thu Jun  4 15:19:55 2020] [c2-0c1s0n0] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
Rank 3 [Thu Jun  4 15:19:55 2020] [c2-0c1s0n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
Rank 2 [Thu Jun  4 15:19:55 2020] [c2-0c1s0n2] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537)..:
MPID_Init(296).........: channel initialization failed
MPIDI_CH3_Init(102)....:
MPID_nem_init(367).....:
MPID_nem_gni_init(1586): GNI_CdmAttach (GNI_RC_INVALID_STATE)
srun: error: nid00449: task 1: Aborted (core dumped)
srun: Terminating job step 23046994.0
srun: error: nid00448: task 0: Aborted (core dumped)
srun: error: nid00451: task 3: Aborted (core dumped)
srun: error: nid00450: task 2: Aborted (core dumped)

Whether it errors or stalls seems to depend on the amount of nodes I use

nrnhines commented 4 years ago

To reiterate and expand from one of my comments to #428. It is our desire to make mpi work correctly under the following conditions. Any machine supporting mpi. Launch python or nrniv. MPI does or does not exist on the machine. One or both mpi4py and h.nrnmpi_init() imported in either order. MPI statically or dynamically linked to nrniv, or dynamically loaded after launch. NEURON_INIT_MPI environment variable does not exist or exists with a value of 0 or 1.

I gather that these desires are not all satisfied on piz daint. I notice that on my machine with test1.py

$ cat test1.py
from mpi4py import MPI; 
print(MPI.COMM_WORLD.size); 
from neuron import h; 
print(h.ParallelContext().nhost());

that

$ mpiexec -n 2 python test1.py
2
2
numprocs=2
2
2

and

$ cat test2.py
from neuron import h;
h.nrnmpi_init();
print(h.ParallelContext().nhost());
from mpi4py import MPI
print(MPI.COMM_WORLD.size);
pc = h.ParallelContext()
pc.barrier()
h.quit()

hines@hines-T7500:~/neuron/nrncmake/build$ mpiexec -n 2 python test2.py
numprocs=2
2
2
2
2
hines@hines-T7500:~/neuron/nrncmake/build$

I added the last three lines in test2.py to avoid a

hines@hines-T7500:~/neuron/nrncmake/build$ mpiexec -n 2 python test3.py
numprocs=2
2
2
2
2
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 0 on
node hines-T7500 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).

You can avoid this message by specifying -quiet on the mpiexec command line.
--------------------------------------------------------------------------

I think the reason was number 2 above

Helveg commented 4 years ago

@nrnhines Correct, there are certain machines (or MPI implementations) where none of this is an issue. For example on my Windows machine I can also execute both test1.py and test2.py correctly, but on Linux and its many MPI implementations it seems to me that the MPI machinery of NEURON sometimes struggles.

A good indication of "health" in the MPI machinery might be this:

nrnmpi_init can detect whether MPI has already been initialized from any source (I don't know if this is technically possible, but mpi4py seems to manage it)
nrnmpi_init and mpi4py can be used together in any order

And perhaps unit tests should be included in the CI pipelines that test NEURON's coexistence with the most common MPI implementations (OpenMPI, MPICH, CrayMPI, ...).

Do these seem like reasonable benchmarks for the MPI features?

I'm sorry that I'm always complaining, opening issues and never contributing but I haven't written a word of C code in my life :)

nrnhines commented 4 years ago

Ok, the next step is to debug on piz daint. It is currently a puzzle from review of the code in nrnmpi.c how MPI_Initialize can be called if it has already been initialized or vice versa

        MPI_Initialized(&flag);

        /* only call MPI_Init if not already initialized */
        if (!flag) {  
#if (USE_PTHREAD)
...
            asrt(MPI_Init_thread(pargc, pargv, required, &provided));
#else
            asrt(MPI_Init(pargc, pargv));
#endif   
            nrnmpi_under_nrncontrol_ = 1;
        } else {  
            nrnmpi_under_nrncontrol_ = 0;
        }

nrnhines commented 4 years ago

@pramodk Is it a problem that (edit: originally copy/pasted the wrong ldd command)

hines@daint106:~/nrn/build> ldd ../install/lib/libnrniv.so
...
    libmpich_gnu_82.so.3 => /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.2/lib/libmpich_gnu_82.so.3 (0x00002b034dc94000)
...

but

hines@daint106:~/nrn/build> ldd /opt/python/3.6.5.7/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so
...
    libmpich_gnu_71.so.3 => /opt/cray/pe/lib64/libmpich_gnu_71.so.3 (0x00002ba461a88000)
...

My symptom is that MPI_Initialize(&flag) is returning 0 even after from mpi4py import MPI

Note that I built with source nrnenv where

hines@daint106:~> cat nrnenv
module swap PrgEnv-cray PrgEnv-gnu
module load daint-mc
module load cray-python/3.6.5.7 PyExtensions/3.6.5.7-CrayGNU-19.10
export CRAYPE_LINK_TYPE=dynamic

export PYTHONPATH=$HOME/nrn/install/lib/python:$PYTHONPATH
export PATH=$HOME/nrn/install/bin:$PATH

export PYTHONHOME="/opt/python/3.6.5.7"
export NRN_PYLIB="/opt/python/3.6.5.7/lib/libpython3.6m.so.1.0"

and a cmake of

cmake .. -DCMAKE_INSTALL_PREFIX=../install -DNRN_ENABLE_INTERVIEWS=OFF

Helveg commented 4 years ago

It doesn't only stall on mpi4py, it also stalls on import h5py. This is great debugging information!

Both of these packages come from modules, compiled by the admins of Piz Daint: so @nrnhines you're probably on to something here (https://github.com/neuronsimulator/nrn/issues/581#issuecomment-639030866)

I load the following relevant modules that provide h5py and mpi4py:

 28) cray-python/3.6.5.7(default)
 29) PyExtensions/3.6.5.7-CrayGNU-19.10
 30) cray-hdf5-parallel/1.10.5.1(default)
 31) h5py/2.8.0-CrayGNU-19.10-python3-parallel

Maybe by inspecting how they were built vs how we both built NEURON can lead to at least a temporary solution.

Here is the full list of loaded modules:

  1) modules/3.2.11.3(default)
  2) cray-mpich/7.7.10(default)
  3) slurm/20.02.2-1
  4) xalt/2.7.24
  5) daint-mc
  6) cray-python/2.7.15.7
  7) pip/20.0.2-py3
  8) gcc/8.3.0(default)
  9) craype-broadwell
 10) craype-network-aries
 11) craype/2.6.1(default)
 12) cray-libsci/19.06.1(default)
 13) udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari
 14) ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari
 15) pmi/5.0.14(default)
 16) dmapp/7.1.1-7.0.1.1_4.8__g38cf134.ari
 17) gni-headers/5.0.12.0-7.0.1.1_6.7__g3b1768f.ari
 18) xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari
 19) job/2.2.4-7.0.1.1_3.8__g36b56f4.ari
 20) dvs/2.12_2.2.151-7.0.1.1_5.6__g7eb5e703
 21) alps/6.6.56-7.0.1.1_4.10__g2e60a7e4.ari
 22) rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari
 23) atp/2.1.3(default)
 24) perftools-base/7.1.1(default)
 25) PrgEnv-gnu/6.0.5
 26) cdt/19.10
 27) CrayGNU/.19.10
 28) cray-python/3.6.5.7(default)
 29) PyExtensions/3.6.5.7-CrayGNU-19.10
 30) cray-hdf5-parallel/1.10.5.1(default)
 31) h5py/2.8.0-CrayGNU-19.10-python3-parallel

pramodk commented 4 years ago

I haven't gone through all comments in this issue discussion (yet) but would like to mention that the MPI initialisation is bit complex topic in neuron workflow because of various use cases we have : python/nrniv launching, external/internal MPI initialisation, dynamic/non-dynamic MPI build etc. Going back to the issue itself on Piz-Daint:

As Michael pointed out, the MPI libraries linked by NEURON and MPI4PY are different. Actually, it's same cray-mpich but different GNU toolchains are used to compile them:

bp000174@daint107:~> ll /opt/cray/pe/lib64/libmpich_gnu_71.so.3
lrwxrwxrwx 1 root root 66 Oct 25  2019 /opt/cray/pe/lib64/libmpich_gnu_71.so.3 -> /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/7.1/lib/libmpich_gnu_71.so.3
bp000174@daint107:~> ll /opt/cray/pe/lib64/libmpich_gnu_82.so.3
lrwxrwxrwx 1 root root 66 Oct 25  2019 /opt/cray/pe/lib64/libmpich_gnu_82.so.3 -> /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.2/lib/libmpich_gnu_82.so.3

My yesterday's NEURON installation from #577 links to gnu_82 (because of GCC 8.3) and MPI4PY is with gnu_71 (GCC v7):

bp000174@daint105:~> ldd ~/install/lib/libnrniv.so
    linux-vdso.so.1 (0x00007ffc63a7d000)
    libreadline.so.7 => /lib64/libreadline.so.7 (0x00002ba831fbf000)
    libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002ba83220e000)
    librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002ba832749000)
    libmpich_gnu_82.so.3 => /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.2/lib/libmpich_gnu_82.so.3 (0x00002ba83294d000)
    libm.so.6 => /lib64/libm.so.6 (0x00002ba832f0f000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ba833247000)
    libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00002ba833465000)
    libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002ba8337ee000)
    libc.so.6 => /lib64/libc.so.6 (0x00002ba833a06000)
    libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00002ba833dc0000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002ba833fee000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002ba8341f2000)
    libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002ba8343f5000)
    librt.so.1 => /lib64/librt.so.1 (0x00002ba8345f8000)
    libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002ba834800000)
    libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002ba834a84000)
    libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002ba834c8e000)
    libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00002ba834ed7000)
    libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002ba835346000)
    /lib64/ld-linux-x86-64.so.2 (0x00002ba831884000)
    libz.so.1 => /lib64/libz.so.1 (0x00002ba835586000)

bp000174@daint105:~> ldd /opt/python/3.6.5.7/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffe663f9000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002b17ec8f1000)
    libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002b17ecaf5000)
    librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002b17ed030000)
    libmpich_gnu_71.so.3 => /opt/cray/pe/lib64/libmpich_gnu_71.so.3 (0x00002b17ed234000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b17ed7f6000)
    libc.so.6 => /lib64/libc.so.6 (0x00002b17eda14000)
    /lib64/ld-linux-x86-64.so.2 (0x00002b17ec390000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002b17eddce000)
    libm.so.6 => /lib64/libm.so.6 (0x00002b17edfd1000)
    libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002b17ee309000)
    librt.so.1 => /lib64/librt.so.1 (0x00002b17ee50c000)
    libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002b17ee714000)
    libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002b17ee998000)
    libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002b17eeba2000)
    libgfortran.so.4 => /opt/cray/pe/gcc-libs/libgfortran.so.4 (0x00002b17eedeb000)
    libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002b17ef1bf000)
    libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002b17ef3d7000)

So when I run test provided, I can reproduce the issue:

bp000174@daint105:~> srun python test.py
2
1
2
1

Ok. So let's now force the loading of libmpich_gnu_82.so.3 for both NEURON and MPI4PY:

bp000174@daint105:~> export LD_PRELOAD=/opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.1/lib/libmpich_gnu_82.so.3
bp000174@daint105:~> srun python test.py
2
2
2
2
numprocs=2

Looks good!

So, how this problem can be solved? Let's install mpi4py from source with the current default GNU toolchain 8.3 and cray-mpich module loaded:

wget  https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.3.tar.gz
tar -xvzf mpi4py-3.0.3.tar.gz
cd mpi4py-3.0.3/
python setup.py build --mpicc=`which cc`
python setup.py install --user

Check library links:

bp000174@daint105:~> ldd /users/bp000174/.local/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffd533ee000)
    /opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.1/lib/libmpich_gnu_82.so.3 (0x00002b94cdef7000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002b94ce4b9000)
    libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002b94ce6bd000)
    librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002b94cebf8000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b94cedfc000)
    libc.so.6 => /lib64/libc.so.6 (0x00002b94cf01a000)
    libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002b94cf3d4000)
    librt.so.1 => /lib64/librt.so.1 (0x00002b94cf5d7000)
    libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002b94cf7df000)
    libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002b94cfa63000)
    libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002b94cfc6d000)
    libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00002b94cfeb6000)
    libm.so.6 => /lib64/libm.so.6 (0x00002b94d0325000)
    libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002b94d065d000)
    libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002b94d0875000)
    /lib64/ld-linux-x86-64.so.2 (0x00002b94cd954000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002b94d0ab5000)
    libz.so.1 => /lib64/libz.so.1 (0x00002b94d0cb8000)

good!

Try again test:

bp000174@daint105:~> unset LD_PRELOAD
bp000174@daint105:~> srun python test.py
2
2
2
2
numprocs=2

all good!

Let's do other way around : build NEURON with GNU toolchain 7.3 and make sure everyone links with libmpich_gnu_71.so.3.

 module swap gcc/8.3.0 gcc/7.3.0

# from new build dir
cmake .. -DCMAKE_INSTALL_PREFIX=`pwd`/install -DNRN_ENABLE_INTERVIEWS=OFF
make -j
make install

Check if library now links with libmpich_gnu_71.so.3:

bp000174@daint105:~/nrn/build_gcc7> ldd /users/bp000174/nrn/build_gcc7/install/lib/libnrniv.so
    linux-vdso.so.1 (0x00007fffe9f9c000)
    libreadline.so.7 => /lib64/libreadline.so.7 (0x00002b11614af000)
    libpython3.6m.so.1.0 => /opt/python/3.6.5.7/lib/libpython3.6m.so.1.0 (0x00002b11616fe000)
    librca.so.0 => /opt/cray/rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari/lib64/librca.so.0 (0x00002b1161c39000)
    libmpich_gnu_71.so.3 => /opt/cray/pe/lib64/libmpich_gnu_71.so.3 (0x00002b1161e3d000)
    libm.so.6 => /lib64/libm.so.6 (0x00002b11623ff000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b1162737000)
    libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00002b1162955000)
    libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00002b1162cde000)
    libc.so.6 => /lib64/libc.so.6 (0x00002b1162ef6000)
    libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00002b11632b0000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002b11634de000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002b11636e2000)
    libxpmem.so.0 => /opt/cray/xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari/lib64/libxpmem.so.0 (0x00002b11638e5000)
    librt.so.1 => /lib64/librt.so.1 (0x00002b1163ae8000)
    libugni.so.0 => /opt/cray/ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari/lib64/libugni.so.0 (0x00002b1163cf0000)
    libudreg.so.0 => /opt/cray/udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari/lib64/libudreg.so.0 (0x00002b1163f74000)
    libpmi.so.0 => /opt/cray/pe/pmi/5.0.14/lib64/libpmi.so.0 (0x00002b116417e000)
    libgfortran.so.4 => /opt/cray/pe/gcc-libs/libgfortran.so.4 (0x00002b11643c7000)
    libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00002b116479b000)
    /lib64/ld-linux-x86-64.so.2 (0x00002b1160d77000)

And re-run the test to see if it now works with mpi4py present on the system and if it's compatible with NEURON:

# make sure to remove your new mpi4py
bp000174@daint105:~>  rm -rf $HOME/.local/lib/python3.6/site-packages/mpi4py*

bp000174@daint105:~> export PYTHONPATH=/users/bp000174/nrn/build_gcc7/install/lib/python:$PYTHONPATH

bp000174@daint105:~> srun python test.py
2
2
2
2
numprocs=2

In summary, to avoid surprises, we should link to the same MPI library for all software components being used together.

Hope this helps!

nrnhines commented 4 years ago

I'd like to propose the hypothesis that if there are two conceptually the same shared libraries that differ only in library name, they can both be loaded but will not share statiic data. E.g. that can explain why mpi4py can MPI_Init but nrnmpi can then call MPI_Initialized(&flag) and flag can be 0. In our case I believe the different mpi libaries were linked against at build time. I doubt it is worth it, but this could be avoided by using -DNRN_ENABLE_MPI_DYNAMIC=ON . Then it would be possible to check to see if an mpi library is already loaded (assuming mpi4py was imported first) and dlopen that one.

This reminds me of an otherwise unrelated issue currently on the table where it seems that the current linux wheel for NEURON has an old readline linked into the libnrniv.so and that seems to get mixed in with the system readline loaded by python and perhaps is the cause of a problem when pdb is used in that typing to the console (when python is not run interactively) gives a segmentation error with a __GI__IO_putc (c=13, fp=0x0) at putc.c:28

pramodk commented 4 years ago

In our case I believe the different mpi libaries were linked against at build time. I doubt it is worth it, but this could be avoided by using -DNRN_ENABLE_MPI_DYNAMIC=ON . Then it would be possible to check to see if an mpi library is already loaded (assuming mpi4py was imported first) and dlopen that one.

Due to the complexity and failure possibilities. I also think it's not worth it. It would be good to keep things simple (actually, today it's already complex 😃 ).

This reminds me of an otherwise unrelated issue currently on the table where it seems that the current linux wheel for NEURON has an old readline linked into the libnrniv.so and that seems to get mixed in with the system readline loaded by python and perhaps is the cause of a problem when pdb is used in that typing to the console (when python is not run interactively) gives a segmentation error with a GIIO_putc (c=13, fp=0x0) at putc.c:28

The readline issue is quite specific because when we create a wheel, we use auditwheel program and it complaints if unwanted libraries like readline is linked. As this is non-mpi issue and @Helveg is installing from source, it won't affect them.

pramodk commented 4 years ago

@Helveg : As the issue is explained (and specific to daint software setup), I will close this. Feel free to reopen if you have more questions!

Helveg commented 4 years ago

Ok, is there any open issue tracking the robustness of h.nrnmpi_init?

nrnhines commented 4 years ago

I'm not aware of one.

neuronsimulator / nrn

h.nrnmpi_init() causes various MPI init errors #581