Open VishalKJ opened 4 years ago
In addition when i monitor the usage using htop in case of BAGEL a lot more cores display usage while in case of mpirun -np 1 BAGEL only 1 core seems active
We wrote in the manual that we strongly discourage use of openmpi - at least in the past openmpi has had bugs or issues that are related to threading. Please use Intel’MPI instead (it’s free - or mvapich, though it sometimes requires some careful settings with MKL’s threading). I have not observed such behavior.
On Oct 10, 2019, at 6:36 AM, VishalKJ notifications@github.com wrote:
In addition when i monitor the usage using htop in case of BAGEL a lot more cores display usage while in case of mpirun -np 1 BAGEL only 1 core seems active
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for your reply Dr. Shiozaki. However we managed to resolve the issue. I document so that future readers benefit
If the program is run by 'mpirun -np 1 BAGEL
numactl -H gives me the layout: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41 node 0 size: 65436 MB node 0 free: 42567 MB node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55we have We can read this output as for socket 0 0-13 are seperate cores and 28-41 are the hyperthreaded cores. This means thread (0,28) are on same core or (1,29) are on same core.
So we build our rankfile as folllows: cat rankfile_mpi1 rank 0=hostname slot=0-27
In this rankfile we have booked all the cores on both sockets. Thus our mpirun command has now access to all the physical core. Now if we specify BAGEL_NUM_THREDAS/MKL_NUM_THREADS=56 , 56 threads are launched for this mpi process , thus fully taking advantage of all hyperthredaded cores. We can run this by: mpirun -np 1 -rf rankfile_mpi1 BAGEL inputfile.json
To run two mpi processes with one on each socket/numa-node the corresponding rankfile will be cat rankfile_mpi2 rank 0=argo2 slot=0-13 rank 1=argo2 slot=14-27
run by mpirun -np 2 -rf rankfile_mpi2 BAGEL inputfile.json
Thanks - good to know that worked out for you. Will leave this open so others may see it.
dear developers, I am observing quite different times for a sample input 'casscf+xmspt2' when running bagel in parallel using just BAGEL or mpirun -np 1 . The node I am running on has two sockets with 14 cores on each socket with hypertherading enabled (total 56 cores reported by lscpu) . Using the aforementioned methods of running in both cases the output reports:
But in case of using without mpirun (i.e. just BAGEL) the times of {MOLECULE,CASSCF,SMITH} are {0.29,9.88,41.77} while if the program is run as mpirun -np 1 BAGEL the times are {1.65,35.14,38.81} . These increases/variability in times of MOLECULE and especially CASSCF section are consistent across multiple runs. Is this expected behaviour ? In addition what is correct way to run BAGL for maximum parallel performance ?
BAGEL compiled with GCC-8.3.1/MKL/OPENMPI-4.0.1 CFLAGS=-DNDEBUG -O3 -mavx2 with boost_1.71.0