No speedup from mpi parallelization (SLI and PyNEST)

antolikjan commented 8 years ago

Nest 2.10.0, intel compiler suite, RedHat version 6.7

Compiled with following options: configure CC=mpiicc CFLAGS="-openmp -mt_mpi -O3" CXX=mpiicpc CXXFLAGS="-openmp -mt_mpi -O3" LDFLAGS="-L/.../mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -L/.../compiler/lib/intel64 -liomp5"

To test parallelization we use the hpc_benchmark.sli and slightly modified version of the brunel_alpha_nest.py example script.

The parallelization with threads works perfectly, achieving nearly linear speedup with up to 32 threads (maximum cores on a single node).

However, if I schedule a job with multiple mpi processes (single thread per mpi process) I observe no speedup, if anything the simulation runs slightly slower than in single thread/process condition, both for the sli and pynest scripts.

The outputs of the hpc_benchmark.sli when run with 1 and 16 threads: hpc_benchmark_mpiproc=16_out.tar.gz hpc_benchmark_mpiproc=1_out.tar.gz

The LoadLeveler submit script with which I submit the jobs: sub.ll.tar.gz

antolikjan commented 8 years ago

I have found another problem which might be related to above, so I am posting it here in the hope it will help somebody to diagnose what is going on.

If I increase the number of neurons to above ~50000 neurons (645 mil. synapses), regardless how many mpi processes I select (i.e. I am sure that sufficient amount of memory should have been allocated), the simulation crashes with following alloc errors:

mem_error.tar.gz

This is using the brunel_alpha_nest.py script.

heplesser commented 8 years ago

@antolikjan The hpc_benchmark.sli script sets the total number of virtual processes to 1. How did you adjust this when running in parallel? I am surprised because both in your mpiproc=16 and our mpiproc=1 log, NEST reports 2253 local nodes being simulated. This number should go down proportional to the number of MPI processes.

antolikjan commented 8 years ago

@heplesser yes that is strange but that is the behavior I observe. To be clear I am now attaching the exact combination of files that I use to run the hpc_benchmark.sli with 32 processes:

hpc_benchmark.sli.tar.gz hpc_b_output.tar.gz (the output of the simulation) sub.ll.tar.gz (the loadlevel submit script)

As you can see I am setting the number of virtual processes in the sli script (and accordingly in the loadleveler submit script), and from the output you can see that in fact 32 instances of nest simulator are executed, but as you point out the number of local nodes remains 2253. Any idea what can cause this specific combinations of errors? To me it seems as if some MPI variables are not correctly propagated?

heplesser commented 8 years ago

@antolikjan I had a look at your output, and as I mentioned in #237, in your configure line, the --with-mpi switch is missing. You could try to run the configure line from the first entry in this issue again. From my experience, the configuration report that is output at the end will show With MPI: No. Thus, you get a NEST executable that is linked against MPI libraries, so mpirun will launch it as 32 parallel processes, but these never call MPI_Init(), and thus they do not know about each other. Each process assumes it is alone and simulates the entire network. I have made progress compiling with MPI properly and will report shortly on that in #237.

jougs commented 8 years ago

I've added Information about threading and MPI support to the output of Simulate in #269. Hopefully that will prevent problems like this.

@antolikjan: Can you please re-check and report back. Thanks!

heplesser commented 7 years ago

Closing due to inactivity.

nest / nest-simulator

No speedup from mpi parallelization (SLI and PyNEST) #238