OpenMPI runtime tuning (rankfile)

VishalKJ commented 4 years ago

dear developers, I am observing quite different times for a sample input 'casscf+xmspt2' when running bagel in parallel using just BAGEL or mpirun -np 1 . The node I am running on has two sockets with 14 cores on each socket with hypertherading enabled (total 56 cores reported by lscpu) . Using the aforementioned methods of running in both cases the output reports:

process grid (1, 1) will be used
using 56 threads per process

But in case of using without mpirun (i.e. just BAGEL) the times of {MOLECULE,CASSCF,SMITH} are {0.29,9.88,41.77} while if the program is run as mpirun -np 1 BAGEL the times are {1.65,35.14,38.81} . These increases/variability in times of MOLECULE and especially CASSCF section are consistent across multiple runs. Is this expected behaviour ? In addition what is correct way to run BAGL for maximum parallel performance ?

BAGEL compiled with GCC-8.3.1/MKL/OPENMPI-4.0.1 CFLAGS=-DNDEBUG -O3 -mavx2 with boost_1.71.0

VishalKJ commented 4 years ago

In addition when i monitor the usage using htop in case of BAGEL a lot more cores display usage while in case of mpirun -np 1 BAGEL only 1 core seems active

shiozaki commented 4 years ago

We wrote in the manual that we strongly discourage use of openmpi - at least in the past openmpi has had bugs or issues that are related to threading. Please use Intel’MPI instead (it’s free - or mvapich, though it sometimes requires some careful settings with MKL’s threading). I have not observed such behavior.

On Oct 10, 2019, at 6:36 AM, VishalKJ notifications@github.com wrote:

In addition when i monitor the usage using htop in case of BAGEL a lot more cores display usage while in case of mpirun -np 1 BAGEL only 1 core seems active

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

VishalKJ commented 4 years ago

Thanks for your reply Dr. Shiozaki. However we managed to resolve the issue. I document so that future readers benefit

If the program is run by 'mpirun -np 1 BAGEL ' OpenMPI only reserves 1 core for the MPI process. This subsequently leads to overbooking of this core with BAGEL_NUM_THREADS number of threads. The problem can be alleviated by using rankfiles whcih specify how to book slots for MPI processes. For example if i want to run just one MPI process using the hyperthreading functioality to fully use 56 threads.

numactl -H gives me the layout: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41 node 0 size: 65436 MB node 0 free: 42567 MB node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55we have We can read this output as for socket 0 0-13 are seperate cores and 28-41 are the hyperthreaded cores. This means thread (0,28) are on same core or (1,29) are on same core.

So we build our rankfile as folllows: cat rankfile_mpi1 rank 0=hostname slot=0-27

In this rankfile we have booked all the cores on both sockets. Thus our mpirun command has now access to all the physical core. Now if we specify BAGEL_NUM_THREDAS/MKL_NUM_THREADS=56 , 56 threads are launched for this mpi process , thus fully taking advantage of all hyperthredaded cores. We can run this by: mpirun -np 1 -rf rankfile_mpi1 BAGEL inputfile.json

VishalKJ commented 4 years ago

To run two mpi processes with one on each socket/numa-node the corresponding rankfile will be cat rankfile_mpi2 rank 0=argo2 slot=0-13 rank 1=argo2 slot=14-27

run by mpirun -np 2 -rf rankfile_mpi2 BAGEL inputfile.json

shiozaki commented 4 years ago

Thanks - good to know that worked out for you. Will leave this open so others may see it.

qsimulate-open / bagel

OpenMPI runtime tuning (rankfile) #184