azrael417 commented 7 years ago

I the following problem

[tkurth@gert01 GRID]$ tail -f slurm-3046972.out

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

Grid : Message        : Requesting 134217728 byte stencil comms buffers 
Grid : Message        : Grid is setup to use 32 threads
Grid : Message        : Making s innermost grids
^[[A^[[A^[[ABenchmark_dwf: ../../src/lib/communicator/Communicator_base.cc:49: void *Grid::CartesianCommunicator::ShmBufferMalloc(unsigned long): Assertion `heap_bytes<MAX_MPI_SHM_BYTES' failed.
 ShmBufferMalloc exceeded shared heap size -- try increasing with --shm <MB> flag
 Parameter specified in units of MB (megabytes) 
 Current value is 128
Benchmark_dwf: ../../src/lib/communicator/Communicator_base.cc:49: void *Grid::CartesianCommunicator::ShmBufferMalloc(unsigned long): Assertion `heap_bytes<MAX_MPI_SHM_BYTES' failed.
srun: error: nid02439: tasks 0-1: Aborted
srun: Terminating job step 3046972.0

the code hangs when it tries to make the innermost grids and then fails after 10 minutes. this is my run script

[tkurth@gert01 GRID]$ cat benchmark_dwf.sh
#!/bin/bash
#SBATCH --ntasks-per-core=4
#SBATCH -N 1
#SBATCH -A mpccc
#SBATCH -p regular
#SBATCH -t 2:00:00
#SBATCH -C knl,quad,cache

export OMP_NUM_THREADS=32
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

#MPI stuff
export MPICH_NEMESIS_ASYNC_PROGRESS=MC
export MPICH_MAX_THREAD_SAFETY=multiple
export MPICH_USE_DMAPP_COLL=1

srun -n 2 -c 136 --cpu_bind=cores ./install/grid_sp_mpi/bin/Benchmark_dwf --threads 32 --grid 32.32.32.32 --mpi 1.1.1.2 --dslash-asm --cacheblocking=4.2.2.1
[config.log.txt](https://github.com/paboyle/Grid/files/572824/config.log.txt)
[config.summary.txt](https://github.com/paboyle/Grid/files/572823/config.summary.txt)

commit version

commit c067051d5ff1a3f4c4dea0e72cc9b1b0ad092c7a
Merge: bc248b6 afdeb2b
Author: paboyle <paboyle@ph.ed.ac.uk>
Date:   Wed Nov 2 13:59:18 2016 +0000

Merge branch 'develop' into release/v0.6.0

KNL bin1, cray xc-40, intel 16.0.3.210
build script and configure
```
#!/bin/bash -l
```

module loads

module unload craype-haswell module load craype-mic-knl module load cray-memkind

precision=single comms=mpi

if [ "${precision}" == "single" ]; then installpath=$(pwd)/install/gridsp${comms} else installpath=$(pwd)/install/griddp${comms} fi

mkdir -p build

cd build ../src/configure --prefix=${installpath} \ --enable-simd=KNL \ --enable-precision=${precision} \ --enable-comms=${comms} \ --host=x86_64-unknown-linux \ --enable-mkl \ CXX="CC" \ CC="cc"

#CXXFLAGS="-mkl -xMIC-AVX512 -std=c++11" \
#CFLAGS="-mkl -xMIC-AVX512 -std=c99" \
#LDFLAGS="-mkl -lmemkind"

make -j12 make install

cd ..



4. attached config.log

5. attached config. summary

6. no make.log but should not be necessary hopefully

azrael417 commented 7 years ago

That problem goes away with smaller local volumes but I think the default choice for this buffer is a bit too small.

paboyle commented 7 years ago

You should be able to increase it with the

--shm 512

flag as indicated in the message:

ShmBufferMalloc exceeded shared heap size -- try increasing with * --shm MB * flag Parameter specified in units of MB (megabytes) Current value is 128

Agree the default of 128 is a bit small; I calculate 400MB for 32^4 with back of the envelope which is prone to error.

I'm afraid this ugliness is forced on us by discovering both Cray and OPA interconnects give more bandwidth when using two ranks per node, but run intra node MPI fairly poorly.

azrael417 commented 7 years ago

What definitely helps to cure both issues to some extent is to use thread level comms to saturate BW. However, it seems that Aries does that for you automatically when you leave physical cores and even hyperthreads. We did some tests yesterday day with. Slings qphix and it seems to have automatic message progression. I have used UMT on large scale yesterday for benchmarks (a radiation transport code) and even if I asked for 64 threads per node the system thread utilization for large runs was at about 260 threads all the time. So maybe that works quite well. Using core specialization should further help, but I have not tried that yet (and that is a slurm specific thing).

Am 04.11.2016 um 23:21 schrieb Peter Boyle notifications@github.com:

You should be able to increase it with the

--shm 512 flag

as indicated in the message:

ShmBufferMalloc exceeded shared heap size -- try increasing with --shm flag Parameter specified in units of MB (megabytes) Current value is 128

Agree the default of 128 is a bit small; I calculate 400MB for 32^4 with back of the envelope which is prone to error.

I'm afraid this ugliness is forced on us by discovering both Cray and OPA interconnects give more bandwidth when using two ranks per node, but run intra node MPI fairly poorly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

paboyle / Grid

large heap memory consumption in mpi mode #66

module loads