suny-downstate-medical-center / netpyne

A Python package to facilitate the development, parallel simulation, optimization and analysis of multiscale biological neuronal networks in NEURON.
http://www.netpyne.org
MIT License
143 stars 134 forks source link

random nrn_timeout error for large simulations (128 cores on GCP) #461

Closed salvadord closed 10 months ago

salvadord commented 4 years ago

I'm getting quite frequent and random nrn_timeout in my simulations. This is for an M1 model with 10k multi-compartment multi-channel cells, 7k VecStims, 30M synapses, 5-sec simulations, running on 128 cores (4 nodes x 32 cores; 120GB RAM per node) on Google Cloud CentOS 7 virtual machines, with NEURON7.7.1-37-gd9605cb master (d9605cb), NetPyNE v0.9.4 and Python 3.6.8.

Here's an example error:

nrn_timeout t=4380.62
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

A couple more details which might be relevant:

Below is the output log of a successful simulation, which includes info on the distribution of cells and conns, as well as load balance at the end:


numprocs=128

Start time:  2019-12-22 22:55:08.962242

Reading command line arguments using syntax: python file.py [simConfig=filepath] [netParams=filepath]
Loading file ../data/v56_batch2b/v56_batch2b_0_0_cfg.json ... 
Loading simConfig...
Importing netParams from ../data/v56_batch2b/v56_batch2b_netParams.py

Creating network of 22 cell populations on 128 hosts...
  Number of cells on node 9: 134 
  Number of cells on node 2: 134 

[...]

  Number of connections on node 5: 68670 
  Number of synaptic contacts on node 5: 278168 

[...]

Running simulation for 5000.0 ms...

  Done; run time = 24927.95 s; real-time ratio: 0.00.

Gathering data...
  Done; gather time = 5.70 s.

Analyzing...
  Cells: 17073
  Connections: [removed during gathering to save memory]
  Spikes: 373727 (4.38 Hz)
   IT2 : 5.818 Hz
   SOM2 : 11.850 Hz
   PV2 : 2.940 Hz
   IT4 : 0.754 Hz
   IT5A : 38.843 Hz
   SOM5A : 1.442 Hz
   PV5A : 32.018 Hz
   IT5B : 1.072 Hz
   PT5B : 5.916 Hz
   SOM5B : 0.729 Hz
   PV5B : 26.749 Hz
   IT6 : 1.877 Hz
   CT6 : 0.178 Hz
   SOM6 : 10.056 Hz
   PV6 : 0.567 Hz
   TPO : 2.498 Hz
   TVL : 0.000 Hz
   S1 : 2.563 Hz
   S2 : 2.470 Hz
   cM1 : 1.254 Hz
   M2 : 1.322 Hz
   OC : 2.495 Hz
  Simulated time: 5.0 s; 128 workers
  Run time: 24927.95 s
Saving output as ../data/v56_batch2b/v56_batch2b_0_0.json  ... 
Finished saving!
  Done; saving time = 4.43 s.
Plotting LFP ...
Plotting raster...
Plotting recorded cell traces ... trace
  Done; plotting time = 150.12 s

Total time = 25606.27 s

End time:  2019-12-23 06:01:55.235592
max_comp_time: 24692.92587293384
min_comp_time: 20009.316901733127
avg_comp_time: 21740.200874486483
load_balance: 0.8804222304946103

spike exchange time (run_time-comp_time):  235.026302206541
nrnhines commented 4 years ago

What one would like for diagnosis is the call stack traces at the timeout for each rank. That would allow one to see why not all ranks have called the same collective and diagnose further with that info in mind. On some machines the simulation may be run with ddt to directly view the stack traces but I have no familiarity with that. If stack traces are not available then I was thinking that one could place a function at the beginning of each collective wrapper in src/nrnmpi/nrnspike.c . The function would take an argument which would be a number associated with the collective. The function would broadcast that arg from rank 0 to all other ranks and each rank would compare the rank 0 number with the number associated with the collective wrapper it was called from and print a message if different. This isn't a full stack trace but might give enough info to further diagnose the problem. It is unusual for a timeout to happen so late in a simulation (t=4380.62) I wonder if that indicates running out of some resource such as a memory leak or some excessively long Vector...but there was no error message except the timeout. Anyway, I acknowledge the difficulty of diagnosis when it takes 25000s to get to a nondeterministic error. Does this simulation use callbacks to the interpreter? Does the timeout ever occur when you are NOT calculating LFP? Does the timeout ever occur when you are NOT recording the two voltage traces.

nrnhines commented 4 years ago

Presumptively we will find a bug. But I do wonder about a lack of MPI fault tolerance at that time scale on that machine.

salvadord commented 4 years ago

Thanks @nrnhines! See inline responses below:

"On some machines the simulation may be run with ddt to directly view the stack traces but I have no familiarity with that."

"if stack traces are not available then I was thinking that one could place a function at the beginning of each collective wrapper in src/nrnmpi/nrnspike.c . The function would take an argument which would be a number associated with the collective. The function would broadcast that arg from rank 0 to all other ranks and each rank would compare the rank 0 number with the number associated with the collective wrapper it was called from and print a message if different. This isn't a full stack trace but might give enough info to further diagnose the problem."

"It is unusual for a timeout to happen so late in a simulation (t=4380.62) I wonder if that indicates running out of some resource such as a memory leak or some excessively long Vector...but there was no error message except the timeout."

"Anyway, I acknowledge the difficulty of diagnosis when it takes 25000s to get to a nondeterministic error. Does this simulation use callbacks to the interpreter?"

"Does the timeout ever occur when you are NOT calculating LFP?"

"Does the timeout ever occur when you are NOT recording the two voltage traces."

"Presumptively we will find a bug. But I do wonder about a lack of MPI fault tolerance at that time scale on that machine."

nrnhines commented 4 years ago

git fetch git pull collective-timeout-debug Build and run as normal and see if it prints anything from the printf

printf("%d calls %d but 0 calls %d\n", nrnmpi_myid, i, buf);
salvadord commented 4 years ago

I installed the new version and rerun 11 of the simulations that had previously failed; 3 of them failed again (at 2.4, 2.7 and 2.7 secs) but weirdly only 1 of them showed the nrn_timeout message; the other 2 didn't (they just stopped). Note these jobs are ran using the SLURM job scheduler, so their output is stored in 2 files (.run and .err). Below are the tail lines for each output file of the 3 failed sims:

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_0_1.run
1.5 s
1.6 s
1.7 s
1.8 s
1.9 s
2 s
2.1 s
2.2 s
2.3 s
2.4 s

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_4_4.run
1.8 s
1.9 s
2 s
2.1 s
2.2 s
2.3 s
2.4 s
2.5 s
2.6 s
2.7 s

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_4_1.run
2.7 s
nrn_timeout t=2770.92
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_0_1.err
/usr/bin/id: cannot find name for group ID 2645697571
NEURON -- VERSION 7.8.0-59-gd40331e collective-timeout-debug (d40331e) 2019-12-27
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2019
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 mod/ar_traub.mod mod/cadad.mod mod/cadyn.mod mod/cagk.mod mod/cal_mh.mod mod/cal_mig.mod mod/cancr.mod mod/canin.mod mod/can_mig.mod mod/catcb.mod mod/cat_mig.mod mod/cat_traub.mod mod/ch_CavL.mod mod/ch_CavN.mod mod/ch_KCaS.mod mod/ch_Kdrfastngf.mod mod/ch_KvAngf.mod mod/ch_KvCaB.mod mod/ch_leak.mod mod/ch_Navngf.mod mod/gabab.mod mod/h_BS.mod mod/HCN1.mod mod/h_harnett.mod mod/hin.mod mod/h_kole.mod mod/h_migliore.mod mod/ican_sidi.mod mod/iccr.mod mod/IC.mod mod/iconc_Ca.mod mod/ikscr.mod mod/IKsin.mod mod/kap_BS.mod mod/kapcb.mod mod/kapin.mod mod/kBK.mod mod/kca.mod mod/kctin.mod mod/kdmc_BS.mod mod/kdr_BS.mod mod/kdrcr.mod mod/kdrin.mod mod/km.mod mod/kv.mod mod/MyExp2SynBB.mod mod/my_exp2syn.mod mod/MyExp2SynNMDABB.mod mod/nafcr.mod mod/nafx.mod mod/nap_sidi.mod mod/nax_BS.mod mod/naz.mod mod/Nca.mod mod/savedist.mod mod/vecstim.mod

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_4_4.err
/usr/bin/id: cannot find name for group ID 2645697571
NEURON -- VERSION 7.8.0-59-gd40331e collective-timeout-debug (d40331e) 2019-12-27
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2019
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 mod/ar_traub.mod mod/cadad.mod mod/cadyn.mod mod/cagk.mod mod/cal_mh.mod mod/cal_mig.mod mod/cancr.mod mod/canin.mod mod/can_mig.mod mod/catcb.mod mod/cat_mig.mod mod/cat_traub.mod mod/ch_CavL.mod mod/ch_CavN.mod mod/ch_KCaS.mod mod/ch_Kdrfastngf.mod mod/ch_KvAngf.mod mod/ch_KvCaB.mod mod/ch_leak.mod mod/ch_Navngf.mod mod/gabab.mod mod/h_BS.mod mod/HCN1.mod mod/h_harnett.mod mod/hin.mod mod/h_kole.mod mod/h_migliore.mod mod/ican_sidi.mod mod/iccr.mod mod/IC.mod mod/iconc_Ca.mod mod/ikscr.mod mod/IKsin.mod mod/kap_BS.mod mod/kapcb.mod mod/kapin.mod mod/kBK.mod mod/kca.mod mod/kctin.mod mod/kdmc_BS.mod mod/kdr_BS.mod mod/kdrcr.mod mod/kdrin.mod mod/km.mod mod/kv.mod mod/MyExp2SynBB.mod mod/my_exp2syn.mod mod/MyExp2SynNMDABB.mod mod/nafcr.mod mod/nafx.mod mod/nap_sidi.mod mod/nax_BS.mod mod/naz.mod mod/Nca.mod mod/savedist.mod mod/vecstim.mod

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_4_1.err
/usr/bin/id: cannot find name for group ID 2645697571
NEURON -- VERSION 7.8.0-59-gd40331e collective-timeout-debug (d40331e) 2019-12-27
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2019
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 mod/ar_traub.mod mod/cadad.mod mod/cadyn.mod mod/cagk.mod mod/cal_mh.mod mod/cal_mig.mod mod/cancr.mod mod/canin.mod mod/can_mig.mod mod/catcb.mod mod/cat_mig.mod mod/cat_traub.mod mod/ch_CavL.mod mod/ch_CavN.mod mod/ch_KCaS.mod mod/ch_Kdrfastngf.mod mod/ch_KvAngf.mod mod/ch_KvCaB.mod mod/ch_leak.mod mod/ch_Navngf.mod mod/gabab.mod mod/h_BS.mod mod/HCN1.mod mod/h_harnett.mod mod/hin.mod mod/h_kole.mod mod/h_migliore.mod mod/ican_sidi.mod mod/iccr.mod mod/IC.mod mod/iconc_Ca.mod mod/ikscr.mod mod/IKsin.mod mod/kap_BS.mod mod/kapcb.mod mod/kapin.mod mod/kBK.mod mod/kca.mod mod/kctin.mod mod/kdmc_BS.mod mod/kdr_BS.mod mod/kdrcr.mod mod/kdrin.mod mod/km.mod mod/kv.mod mod/MyExp2SynBB.mod mod/my_exp2syn.mod mod/MyExp2SynNMDABB.mod mod/nafcr.mod mod/nafx.mod mod/nap_sidi.mod mod/nax_BS.mod mod/naz.mod mod/Nca.mod mod/savedist.mod mod/vecstim.mod
salvadord commented 4 years ago

I'm rerunning the same 11 sims but without recording LFP and voltages.

salvadord commented 4 years ago

4/11 jobs with no recording of LFP/voltage crashed with nrn_timeout (I cancelled the rest to save resources). They all showed the nrn_timeout error message but no additional info: nrn_timeout t=1109.17 nrn_timeout t=1847.95 nrn_timeout t=1313.39 nrn_timeout t=2574.72

nrnhines commented 4 years ago

I wonder if in addition to turning off the recordLFP flag you can also turn off the printRunTime flag. The idea is to avoid putting any kind of callback event onto the event queue.

nrnhines commented 4 years ago

If the test above also fails. Please try the following change in order to force output in order to check if ranks 0 and 1 are entering the check_ function but are not leaving. I would not restrict to rank <2 but otherwise, I fear the output would be too voluminous

hines@hines-T7500:~/neuron/collective-timeout-debug/src/nrnmpi$ git diff
diff --git a/src/nrnmpi/mpispike.c b/src/nrnmpi/mpispike.c
index 07b1603a..195141e9 100644
--- a/src/nrnmpi/mpispike.c
+++ b/src/nrnmpi/mpispike.c
@@ -99,14 +99,21 @@ static void make_spikebuf_type() {

 static void check_(int i) {
        int buf = i;
+  if (nrnmpi_myid < 2) {
+    extern double t;
+    printf("%d enter check %d  t=%g\n", nrnmpi_myid, i, t);
+  }
        guard(MPI_Bcast(&buf, 1, MPI_INT, 0, nrnmpi_comm));
        if (buf != i) {
                printf("%d calls %d but 0 calls %d\n", nrnmpi_myid, i, buf);
        }
        guard(MPI_Barrier(nrnmpi_comm));
        if (buf != i) {
-               nrnmpi_abort(0);
+               nrnmpi_abort(i);
        }
+  if (nrnmpi_myid < 2) {
+    printf("%d leave check %d\n", nrnmpi_myid, i);
+  }
 }

 int nrnmpi_spike_exchange() {
hines@hines-T7500:~/neuron/collective-timeout-debug/src/nrnmpi$ 
salvadord commented 4 years ago

Ok thanks, will try removing all callbacks, and adding the code above.

nrnhines commented 4 years ago

It occurred to me that in the call to nrnmpi_abort(i); above, one could encode i, buf, and nrnmpi_myid in to the error message and that could be quite helpful. How about changeing the i to nrnmpi_myid*10000 + i*100 + buf

salvadord commented 4 years ago

ok, so I will change it to: nrnmpi_abort(nrnmpi_myid*10000 + i*100 + buf); thx

nrnhines commented 4 years ago

That should be good for up to 200k ranks :)

salvadord commented 4 years ago

Tried removing all callbacks (LFP, traces and printTime) and 2/11 jobs crashed after ~2.5s; one showed the nrn_timeout error msgs, but the other one didn't. I'm cancelling the rest of jobs now.

Will try the new check_ code tomorrow.

salvadord commented 4 years ago

@nrnhines I'm running some sims with the new check_ code -- just wanted to check that the error printf will only show up once there is nrn_timeout, right?

salvadord commented 4 years ago

Ok so one of the 12 sims now crashed at 1.2sec and there were no error messages at all, ie. not even the nrn_timeout error message:

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_2_2.run
0.3 s
0.4 s
0.5 s
0.6 s
0.7 s
0.8 s
0.9 s
1 s
1.1 s
1.2 s

[ext_salvadordura_gmail_com@g1-login1 v56_batch2]$ tail v56_batch2_2_2.err
/usr/bin/id: cannot find name for group ID 2645697571
NEURON -- VERSION 7.8.0-62-g1d75444 collective-timeout-debug (1d75444) 2019-12-30
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2019
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 mod/ar_traub.mod mod/cadad.mod mod/cadyn.mod mod/cagk.mod mod/cal_mh.mod mod/cal_mig.mod mod/cancr.mod mod/canin.mod mod/can_mig.mod mod/catcb.mod mod/cat_mig.mod mod/cat_traub.mod mod/ch_CavL.mod mod/ch_CavN.mod mod/ch_KCaS.mod mod/ch_Kdrfastngf.mod mod/ch_KvAngf.mod mod/ch_KvCaB.mod mod/ch_leak.mod mod/ch_Navngf.mod mod/gabab.mod mod/h_BS.mod mod/HCN1.mod mod/h_harnett.mod mod/hin.mod mod/h_kole.mod mod/h_migliore.mod mod/ican_sidi.mod mod/iccr.mod mod/IC.mod mod/iconc_Ca.mod mod/ikscr.mod mod/IKsin.mod mod/kap_BS.mod mod/kapcb.mod mod/kapin.mod mod/kBK.mod mod/kca.mod mod/kctin.mod mod/kdmc_BS.mod mod/kdr_BS.mod mod/kdrcr.mod mod/kdrin.mod mod/km.mod mod/kv.mod mod/MyExp2SynBB.mod mod/my_exp2syn.mod mod/MyExp2SynNMDABB.mod mod/nafcr.mod mod/nafx.mod mod/nap_sidi.mod mod/nax_BS.mod mod/naz.mod mod/Nca.mod mod/savedist.mod mod/vecstim.mod
nrnhines commented 4 years ago

This makes no sense. Please send me your diff of the collective-timeout-debug branch. I would have expected a + at the end of the Version hash because of the added printf.

salvadord commented 4 years ago

The date (2019-12-30) and commit (1d75444) shown in the version match the last commit in the branch: https://github.com/neuronsimulator/nrn/commit/1d75444e86312bd9faf7f132f22c41e3902261e8

Here's the diff:

[ext_salvadordura_gmail_com@g1-login1 nrn_debug2]$ git diff HEAD^
diff --git a/src/nrnmpi/mpispike.c b/src/nrnmpi/mpispike.c
index 07b1603..d2abfe9 100644
--- a/src/nrnmpi/mpispike.c
+++ b/src/nrnmpi/mpispike.c
@@ -105,7 +105,7 @@ static void check_(int i) {
        }
        guard(MPI_Barrier(nrnmpi_comm));
        if (buf != i) {
-               nrnmpi_abort(0);
+               nrnmpi_abort(nrnmpi_myid*10000 + i*100 + buf);
        }
 }

[ext_salvadordura_gmail_com@g1-login1 nrn_debug2]$ git log -2
commit 1d75444e86312bd9faf7f132f22c41e3902261e8
Author: Michael Hines <michael.hines@yale.edu>
Date:   Mon Dec 30 09:27:00 2019 -0500

    encode rank, etc. into mpi_abort integer.

commit c4c564ba279284841d817a11480032eea7a74cbe
Merge: d40331e d119e45
Author: Michael Hines <michael.hines@yale.edu>
Date:   Sat Dec 28 17:51:14 2019 -0500

    Merge branch 'master' into collective-timeout-debug
nrnhines commented 4 years ago

I pushed a change to the collective-timeout-debug branch that has the printfs mentioned earlier as well as the nrnmpi_abort informational arg. If this doesn't give a clue then we have to figure out how to get a strack trace on an abort.

salvadord commented 4 years ago

I get the following "undefined reference to t" error during make (doesn't happen if I try with the previous commit):

make[3]: Entering directory `/home/ext_salvadordura_gmail_com/nrn_debug3/src/nrnmpi'
/bin/sh ../../libtool  --tag=CC   --mode=compile mpicc -DHAVE_CONFIG_H -I. -I../.. -I../../src/nrnpython -I../../src/nrnoc -I../../src/oc -I../../src/parallel -I../../src/nrnjava -I../../src/nrncvode -I../../src/sundials    -I../../src/oc  -I. -g -O2 -MT mpispike.lo -MD -MP -MF .deps/mpispike.Tpo -c -o mpispike.lo mpispike.c
libtool: compile:  mpicc -DHAVE_CONFIG_H -I. -I../.. -I../../src/nrnpython -I../../src/nrnoc -I../../src/oc -I../../src/parallel -I../../src/nrnjava -I../../src/nrncvode -I../../src/sundials -I../../src/oc -I. -g -O2 -MT mpispike.lo -MD -MP -MF .deps/mpispike.Tpo -c mpispike.c  -fPIC -DPIC -o .libs/mpispike.o
mv -f .deps/mpispike.Tpo .deps/mpispike.Plo
/bin/sh ../../libtool  --tag=CC   --mode=link mpicc -I../../src/oc  -I. -g -O2   -o libnrnmpi.la -rpath /usr/local/nrn/x86_64/lib nrnmpi.lo bbsmpipack.lo mpispike.lo nrnrt.lo  -lm -ldl
libtool: link: rm -fr  .libs/libnrnmpi.la .libs/libnrnmpi.lai .libs/libnrnmpi.so .libs/libnrnmpi.so.0 .libs/libnrnmpi.so.0.0.0
libtool: link: mpicc -shared  -fPIC -DPIC  .libs/nrnmpi.o .libs/bbsmpipack.o .libs/mpispike.o .libs/nrnrt.o   -lm -ldl  -O2   -Wl,-soname -Wl,libnrnmpi.so.0 -o .libs/libnrnmpi.so.0.0.0
libtool: link: (cd ".libs" && rm -f "libnrnmpi.so.0" && ln -s "libnrnmpi.so.0.0.0" "libnrnmpi.so.0")
libtool: link: (cd ".libs" && rm -f "libnrnmpi.so" && ln -s "libnrnmpi.so.0.0.0" "libnrnmpi.so")
libtool: link: ( cd ".libs" && rm -f "libnrnmpi.la" && ln -s "../libnrnmpi.la" "libnrnmpi.la" )
make[3]: Leaving directory `/home/ext_salvadordura_gmail_com/nrn_debug3/src/nrnmpi'
Making all in oc
make[3]: Entering directory `/home/ext_salvadordura_gmail_com/nrn_debug3/src/oc'
make  all-am
make[4]: Entering directory `/home/ext_salvadordura_gmail_com/nrn_debug3/src/oc'
/bin/sh ../../libtool  --tag=CC   --mode=link mpicc  -g -O2   -o oc ocmain.o ocnoiv.o nocable.o modlreg.o ockludge.o libocxt.la liboc.la ../nrnmpi/libnrnmpi.la ../memacs/libmemacs.la ../readline/libreadline.la -lncurses   -lm -ldl
libtool: link: mpicc -g -O2 -o .libs/oc ocmain.o ocnoiv.o nocable.o modlreg.o ockludge.o  ./.libs/libocxt.so ./.libs/liboc.so -lX11 ../nrnmpi/.libs/libnrnmpi.so ../memacs/.libs/libmemacs.so ../readline/.libs/libreadline.so -lncurses -lm -ldl -Wl,-rpath -Wl,/usr/local/nrn/x86_64/lib
**../nrnmpi/.libs/libnrnmpi.so: undefined reference to `t'**
collect2: error: ld returned 1 exit status
make[4]: *** [oc] Error 1
make[4]: Leaving directory `/home/ext_salvadordura_gmail_com/nrn_debug3/src/oc'
make[3]: *** [all] Error 2
make[3]: Leaving directory `/home/ext_salvadordura_gmail_com/nrn_debug3/src/oc'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/ext_salvadordura_gmail_com/nrn_debug3/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ext_salvadordura_gmail_com/nrn_debug3'
make: *** [all] Error 2
salvadord commented 4 years ago

Btw, I found this post suggesting using the backtrace() functions to print the stack trace on abort (not really sure if useful): https://stackoverflow.com/questions/77005/how-to-automatically-generate-a-stacktrace-when-my-program-crashes

nrnhines commented 4 years ago

I didn't try building with autotools. I used cmake which puts everything into libnrniv.so . Hence the extern t was not a problem. If you are restricted to running configure I can work around the issue. This definitely would not work at present with dynamically loaded mpi.

backtrace sounds like a definite possibility. One puzzle is the variability of abort via timeout vs just quietly quitting. I'll experiment with it a bit and maybe be able to put it into the branch. There ought also be a way of autogenerating the backtrace files or hopefully tiny coredumps as a consequence of calling abort (called I presume by MPI_Abort)

nrnhines commented 4 years ago

I read the documentation for MPI_Abort again and for the first time noticed the statement This routine should not be used from within a signal handler. Which is exactly what is happening in static void timed_out(int sig) { in src/nrnoc/nrntimeout.c I've never experienced a problem with this before, though. And never the symptom of printf not getting to stdout and the system exiting without an error message. If there is a problem with MPI_Abort here, I don't see how returning would help as everyone is likely stuck in an MPI collective. And calling abort(...) would make me worry that not all the mpi processes would exit. Any suggestions @pramodk ?

pramodk commented 4 years ago

General comments: whenever I see such timeout issues on cloud environment I first think about where these X number of nodes are allocated. For example, will they be in a single rack or situated across data centres? I am mentioning this because on many cloud environments this is not guaranteed and can cause significant communication issues (saw with some BBP tests on cloud).

We should just check about how cloud nodes are allocated.

On some machines the simulation may be run with ddt to directly view the stack traces but I have no familiarity with that.

DDT requires a license and we won’t have that for google cloud. If needed, we can install other open source tools for diagnosis.

It is unusual for a timeout to happen so late in a simulation (t=4380.62) I wonder if that indicates running out of some resource such as a memory leak or some excessively long Vector...but there was no error message except the timeout.

Anyway, I acknowledge the difficulty of diagnosis when it takes 25000s to get to a nondeterministic error.

If we reduce the model size considerably and keep #cores + #nodes same, doesn’t this error appear?

I'm trying to run on a single node with 96 cores to see if that reeduces the timeouts (currently in 4 nodes x 32 cores). Any other suggestions to test this?

Answer to this would be very helpful! Did you see this issue in single node?

I read the documentation for MPI_Abort again and for the first time noticed the statement. This routine should not be used from within a signal handler.

I also think this won’t be an issue.

Any suggestions @pramodk ?

I wonder if we should run small MPI benchmark at the beginning to see if MPI commutation is stable on allocated nodes.

salvadord commented 4 years ago

Thanks @pramodk, those are very good suggestions.

I did try running on 96-core nodes (both regular and with high memory), but in both cases got the following error, which I interpreted as memory-related due to too many processes on the same node:

OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 2514728 max
>>> Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/numpy/core/__init__.py", line 17, in <module>
    from . import multiarray
  File "/usr/local/lib64/python3.6/site-packag>>> Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/numpy/core/__init__.py", line 17, in <module>
    from . import multiarray
  File "/usr/local/lib64/python3.6/site-packages/numpy/core/multiarray.py", line 14, in <module>
    from . import overrides
  File "/usr/local/lib64/python3.6/site-packages/numpy/core/overrides.py", line 7, in <module>
    from numpy.core._multiarray_umath import (
ImportError: PyCapsule_Import could not import module "datetime"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "init.py", line 15, in <module>
    import matplotlib; matplotlib.use('Agg')  # to avoid graphics error in servers
  File "/usr/local/lib64/python3.6/site-packages/matplotlib/__init__.py", line 138, in <module>
    from . import cbook, rcsetup
  File "/usr/local/lib64/python3.6/site-packages/matplotlib/cbook/__init__.py", line 31, in <module>
    import numpy as np
  File "/usr/local/lib64/python3.6/site-packages/numpy/__init__.py", line 142, in <module>
    from . import core
  File "/usr/local/lib64/python3.6/site-packages/numpy/core/__init__.py", line 47, in <module>
    raise ImportError(msg)
ImportError: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy c-extensions failed.
- Try uninstalling and reinstalling numpy.
- If you have already done that, then:
  1. Check that you expected to use Python3.6 from "/usr/bin/python3",
     and that you have no directories in your PATH or PYTHONPATH that can
     interfere with the Python and numpy version "1.17.2" you're trying to use.
  2. If (1) looks fine, you can open a new issue at
     https://github.com/numpy/numpy/issues.  Please include details on:
     - how you installed Python
     - how you installed numpy
     - your operating system
     - whether or not you have multiple versions of Python installed
     - if you built from source, your compiler versions and ideally a build log

- If you're working with a numpy git repository, try `git clean -xdf`
  (removes all files not under version control) and rebuild numpy.

Note: this error has many possible causes, so please don't comment on
an existing issue about this - open a new one instead.

Original error was: PyCapsule_Import could not import module "datetime"
salvadord commented 4 years ago

I will try 1) running on a single 64-core node, and 2) running smaller models on 4 nodes * 32 cores to see if get the same issues.

salvadord commented 4 years ago

Fixed the 96-core issue by running:

export OMP_NUM_THREADS=1
export USE_SIMPLE_THREADED_LEVEL3=1

So I now have 10 sims each running on a single 96-core node ... let's see if get timeout here or not...

salvadord commented 4 years ago

No timeouts!! So seems like the cause was slow MPI communication between the GCP 32-core nodes. I'm running some more sims to confirm. The only issue with this is that the largest GCP nodes have 96 cores, so that would be the limit for individual simulations. I will bring this up with the Google guys to see if there's a way to guarantee faster communication between nodes.

Thanks so much @nrnhines and @pramodk for your help!

pramodk commented 4 years ago

I will bring this up with the Google guys to see if there's a way to guarantee faster communication between nodes.

Considering timeouts happen sporadically at different times during simulation, it seems the network communication performance (specifically latency) is not stable. If they have some suggestions to alleviate this, it will be great (In AWS we often refer to Placement Groups strategy for the same).

Otherwise we can look into disabling timeouts in NEURON. By the way, have you tried increasing timeout using ParallelContext.timeout? Not sure if applicable here but in some of our sims (specific use cases) we have increased this by ~25x i.e. 500sec.

salvadord commented 4 years ago

I had already tried to increase parallelcontext.timeout and even disable it (set to 0), but then some jobs just took forever (e.g. job hadn't finished after 28 hrs, even though similar ones took ~10hrs). So seems like these unstable latencies can be quite persistent.

I ran 20 other jobs in the 96-core node and 2 of them actually timed out (no error msg) after ~2sec (~3 hours). The rest completed ok. So timeouts still occurr but at lower rate than when using multiple cores (~6% vs >25% chance). This small percentage in the single node might be solved by increasing parallelcontext.timeout -- I will try 500sec as you suggested.

I will also ask Google about he Placement Groups strategy. Thanks again!

nrnhines commented 4 years ago

Given that kind of latency, I don't think we can rule out a bug in NEURON. The puzzle for me has become how it can fail without a timeout error (on the less than 96 core/node jobs). Is it just that printing the message does not make it to the stdout? Does not seem like that is the case as I gather that there is no message from MPI_Abort. If there was some consistent indication that there really was a timeout, it would make sense to get some more info with the backtrace. Or perhaps, instead of MPI_Abort, print the backtrace for every rank, set an error flag, longjump back to the beginning, check the error flag and abort() (note everyone would abort as we would change things so that everyone would get a timeout instead of just rank 0). Given that the 96 core node job did give a timeout whenever it fails, I will put in a backtrace print to the collective-timeout-debug branch. Could be I'm barking up the wrong tree:) I'm semi disappointed that the guard checks around every MPI collective call and the check function with its broadcast to see if everyone is participating in the same collective ended up not being useful. Although it does strengthen the possibility it is a cloud issue.

salvadord commented 4 years ago

I ran 7 more sims (including the 2 that failed) with timeout set to 500sec, and they completed ok.

@nrnhines I agree it is still worth investigating if there are any NEURON issues. I will try running with the latest changes in timeout-debug branch (I will build with cmake). Thanks.

salvadord commented 4 years ago

We found this on the GCP web which sort of supports the latency hypothesis:

"Inter-node latency. For tightly-coupled HPC apps, performance might be sensitive to the inter-node latency between nodes in the cluster. Because Google Cloud provides nodes with sizes up to 64 cores, you can run 64-way parallel jobs without traversing nodes. In most cases, jobs of around 1000 cores or smaller perform reasonably well on non-specialized network hardware."

https://cloud.google.com/solutions/using-clusters-for-large-scale-technical-computing