shankar1729 / jdftx

JDFTx: software for joint density functional theory
http://jdftx.org
79 stars 49 forks source link

BIG trouble with thread placement #302

Closed ximik69 closed 6 months ago

ximik69 commented 9 months ago

Hello dear Shankar.

Despite some good experience with JDFTx recently I ran into trouble with thread placement. I asked local sysadmins for help but after long time playing with diferent options we did not manage to run program properly. I run JDFTx on cluster with Intel processors, managed by slurm. There are 2 sockets per node 2 numa nodes per socket = 4 numa nodes x 12 cores. JDFTx was compiled by gcc 11.2, with mkl, OpenMPI, gsl, libxc and scalapack.

Symptoms: JDFTx starts proper number of threads, but only a few of cores are loaded, others are working for brief amount of time then idle. Example - starting 4 mpi processes 12 threads each leads to 4 loaded cores # 0, 4, 24, 28 with others idling. I tried nonthreaded mkl, recompiling OpenMPI, using --bind-to none in mpiexec/mpirun arguments but nothing changed the situation.

Short summary - MPI processes x 1 thread each - work Only OpenMP treads w/o MPI - works MPI/OpenMP - thread placement problem.

Below is list of libraries, to which JDFTx is linked.

ldd which jdftx linux-vdso.so.1 (0x00007ffd88fed000) libjdftx.so => /net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/lib/libjdftx.so (0x00001522929c3000) libmpi_cxx.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libmpi_cxx.so.40 (0x0000152293332000) libmpi.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libmpi.so.40 (0x0000152293206000) libgsl.so.25 => /net/software/testing/software/GSL/2.7-GCC-11.2.0/lib/libgsl.so.25 (0x000015229253b000) libmkl_scalapack_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_scalapack_lp64.so.1 (0x0000152291e0e000) libmkl_gf_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_gf_lp64.so.1 (0x0000152291270000) libmkl_blacs_openmpi_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.1 (0x00001522931bc000) libmkl_intel_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_intel_lp64.so.1 (0x00001522906d1000) libmkl_gnu_thread.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_gnu_thread.so.1 (0x000015228eb46000) libmkl_core.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_core.so.1 (0x000015228a6d8000) libgomp.so.1 => /net/software/testing/software/GCCcore/11.2.0/lib64/libgomp.so.1 (0x0000152293175000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000015228a4b8000) libxc.so.9 => /net/software/testing/software/libxc/5.1.6-GCC-11.2.0/lib/libxc.so.9 (0x0000152289b94000) libstdc++.so.6 => /net/software/testing/software/GCCcore/11.2.0/lib64/libstdc++.so.6 (0x0000152289968000) libm.so.6 => /lib64/libm.so.6 (0x00001522895e6000) libgcc_s.so.1 => /net/software/testing/software/GCCcore/11.2.0/lib64/libgcc_s.so.1 (0x00001522895cc000) libc.so.6 => /lib64/libc.so.6 (0x0000152289207000) libopen-rte.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libopen-rte.so.40 (0x000015228914f000) libopen-orted-mpir.so => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libopen-orted-mpir.so (0x0000152293168000) libopen-pal.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libopen-pal.so.40 (0x000015228909f000) librt.so.1 => /lib64/librt.so.1 (0x0000152288e97000) libutil.so.1 => /lib64/libutil.so.1 (0x0000152288c93000) libhwloc.so.15 => /net/software/testing/software/hwloc/2.5.0-GCCcore-11.2.0/lib/libhwloc.so.15 (0x0000152288c38000) libpciaccess.so.0 => /net/software/testing/software/libpciaccess/0.16-GCCcore-11.2.0/lib/libpciaccess.so.0 (0x000015229315c000) libxml2.so.2 => /net/software/testing/software/libxml2/2.9.10-GCCcore-11.2.0/lib/libxml2.so.2 (0x0000152288aca000) libdl.so.2 => /lib64/libdl.so.2 (0x00001522888c6000) libz.so.1 => /net/software/testing/software/zlib/1.2.11-GCCcore-11.2.0/lib/libz.so.1 (0x00001522888ad000) liblzma.so.5 => /net/software/testing/software/XZ/5.2.5-GCCcore-11.2.0/lib/liblzma.so.5 (0x0000152288885000) libevent_core-2.1.so.7 => /net/software/testing/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_core-2.1.so.7 (0x000015228884e000) libevent_pthreads-2.1.so.7 => /net/software/testing/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_pthreads-2.1.so.7 (0x000015228884a000) /lib64/ld-linux-x86-64.so.2 (0x000015229312c000)

ximik69 commented 9 months ago

Below is startup script

#!/bin/bash
#
# job time, change for what your job requires
#SBATCH -t 00:5:00
#$BATCH -A plgzl3a-cpu
# requesting the number of nodes needed
#SBATCH -N 1
#####SBATCH --exclusive
#SBATCH -p plgrid-testing
#SBATCH --tasks-per-node=4
#SBATCH --cpus-per-task=12
# job name
#SBATCH -J JDFTx_test
#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err

# write this script to stdout-file - useful for scripting errors
cat $0

# load the modules required for you program - customise for your program

cd $SLURM_SUBMIT_DIR

. /net/people/plgrid/plgigoro/.bashrc

module load jdftx/1.7.0-foss-2021b-mkl
export OMP_DISPLAY_ENV=TRUE
export OMP_DISPLAY_AFFINITY=TRUE
export OMP_NUM_THREADS=12
export OMP_PROC_BIND=TRUE
dbg=""
runit="mpiexec --report-bindings jdftx $dbg -m -i X.in -o X.out"
$runit
shankar1729 commented 9 months ago

Can you try OpenMPI with threads using mpirun --bind-to none directly on the compute node, without SLURM? Could be a clash between the affinity settings in SLURM and OpenMPI. If we can confirm that, we can see what to change.

Also, did you try tweaking the cpu binding flags in sbatch/srun instead of the openMPI ones?

Best, Shankar

shankar1729 commented 9 months ago

Also, in response to the job file above, the OMP flags take no effect on most of JDFTx which is parallelized using pthreads, not OpenMP. That part may only affect threads within MKL. I would recommend linking JDFTx with ThreadedBlas=no for MKL, so that JDFTx takes care of all threading, at least for the initial debugging.

Best, Shankar

ximik69 commented 9 months ago

Dear Shankar, thanks a lot for such a swift answer!!!

I started mpiexec --report-bindings --bind-to none jdftx $dbg -m -i X.in -o X.out

Unfortunately I can not ssh to running node or attach to it using srun -N1 -n1 --jobid=$1 --pty /bin/bash.

I'll inspect what proggram writes.

ximik69 commented 9 months ago

I tried to run this: srun -p plgrid-testing -t 50 -A plgzl3a-cpu -N 1 -n 24 --pty /bin/bash

then: mpiexec --report-bindings --bind-to none -n 2 jdftx $dbg -m -c 12 -i X.in | tee X.out

I can see that iteration times are not bad but I can not view simultaneously with htop.

github also screws up with attachments, therefore I have to post em as is.

X.out:

*************** JDFTx 1.7.0  ***************

Start date and time: Thu Oct  5 15:32:41 2023
Executable jdftx with command-line: -m -c 12 -i X.in
Running on hosts (process indices):  ac0619 (0-1)
Divided in process groups (process indices):  0 (0)  1 (1)
Resource initialization completed at t[s]:      0.00
Run totals: 2 processes, 24 threads, 0 GPUs

Input parsed successfully to the following command list (including defaults):

basis kpoint-dependent
cache-projectors no
coords-type Lattice
core-overlap-check vector
coulomb-interaction Periodic
davidson-band-ratio 1.1
density-of-states Etol 1.000000e-06 Esigma 0.000000e+00 \
        Complete \
    Total \
        Occupied \
    Total
dump End State IonicPositions Lattice EigStats DOS
dump Electronic State Forces
dump Ionic State IonicPositions Forces Lattice
dump-interval Electronic 8
dump-name X.$var
elec-cutoff 10
elec-eigen-algo Davidson
elec-ex-corr gga-PBE
elec-smearing Fermi 0.0095
electronic-minimize  \
    dirUpdateScheme      FletcherReeves \
    linminMethod         DirUpdateRecommended \
    nIterations          2000 \
    history              15 \
    knormThreshold       0 \
    energyDiffThreshold  1e-06 \
    nEnergyDiff          2 \
    alphaTstart          1 \
    alphaTmin            1e-10 \
    updateTestStepSize   yes \
    alphaTreduceFactor   0.1 \
    alphaTincreaseFactor 3 \
    nAlphaAdjustMax      3 \
    wolfeEnergy          0.0001 \
    wolfeGradient        0.9 \
    fdTest               no
exchange-regularization WignerSeitzTruncated
fluid None
fluid-ex-corr (null) lda-PZ
fluid-gummel-loop 10 1.000000e-05
fluid-minimize  \
    dirUpdateScheme      PolakRibiere \
    linminMethod         DirUpdateRecommended \
    nIterations          100 \
    history              15 \
    knormThreshold       0 \
    energyDiffThreshold  0 \
    nEnergyDiff          2 \
    alphaTstart          1 \
    alphaTmin            1e-10 \
    updateTestStepSize   yes \
    alphaTreduceFactor   0.1 \
    alphaTincreaseFactor 3 \
    nAlphaAdjustMax      3 \
    wolfeEnergy          0.0001 \
    wolfeGradient        0.9 \
    fdTest               no
fluid-solvent H2O 55.338 ScalarEOS \
    epsBulk 78.4 \
    pMol 0.92466 \
    epsInf 1.77 \
    Pvap 1.06736e-10 \
    sigmaBulk 4.62e-05 \
    Rvdw 2.61727 \
    Res 1.42 \
    tauNuc 343133 \
    poleEl 15 7 1
forces-output-coords Positions
initial-state X.$var
ion Si   0.061602711441747   0.292062989005408   0.186816515940134 1
ion Si   0.938397257183252   0.707937011744592   0.813183447809866 1
ion Si   0.938397257183252   0.292062989005408   0.313183447809866 1
ion Si   0.061602711441747   0.707937011744592   0.686816515940134 1
ion Si   0.561602711441747   0.792062989005408   0.186816515940134 1
ion Si   0.438397257183252   0.207937011744592   0.813183447809866 1
ion Si   0.438397257183252   0.792062989005408   0.313183447809866 1
ion Si   0.561602711441747   0.207937011744592   0.686816515940134 1
ion Si   0.098290478501462   0.037391893097691   0.357765039787784 1
ion Si   0.901709490123537   0.962608107652309   0.642234923962216 1
ion Si   0.901709490123537   0.037391893097691   0.142234923962216 1
ion Si   0.098290478501462   0.962608107652309   0.857765039787784 1
ion Si   0.598290478501462   0.537391893097691   0.357765039787784 1
ion Si   0.401709490123537   0.462608107652309   0.642234923962216 1
ion Si   0.401709490123537   0.537391893097691   0.142234923962216 1
ion Si   0.598290478501462   0.462608107652309   0.857765039787784 1
ion Na   0.351190096562829   0.338260596131203   0.358380793977603 1
ion Na   0.648809872062171   0.661739404618797   0.641619169772397 1
ion Na   0.648809872062171   0.338260596131203   0.141619169772397 1
ion Na   0.351190096562829   0.661739404618797   0.858380793977603 1
ion Na   0.851190096562829   0.838260596131203   0.358380793977603 1
ion Na   0.148809872062171   0.161739404618797   0.641619169772397 1
ion Na   0.148809872062171   0.838260596131203   0.141619169772397 1
ion Na   0.851190096562829   0.161739404618797   0.858380793977603 1
ion Na   0.369136137512290   0.093767271292924   0.048717562509565 1
ion Na   0.630863831112710   0.906232729457076   0.951282401240435 1
ion Na   0.630863831112710   0.093767271292924   0.451282401240435 1
ion Na   0.369136137512290   0.906232729457076   0.548717562509565 1
ion Na   0.869136137512290   0.593767271292924   0.048717562509565 1
ion Na   0.130863831112710   0.406232729457076   0.951282401240435 1
ion Na   0.130863831112710   0.593767271292924   0.451282401240435 1
ion Na   0.869136137512290   0.406232729457076   0.548717562509565 1
ion-species SG15/$ID_ONCV_PBE.upf
ion-width 0
ionic-minimize  \
    dirUpdateScheme      L-BFGS \
    linminMethod         DirUpdateRecommended \
    nIterations          0 \
    history              15 \
    knormThreshold       0.0001 \
    energyDiffThreshold  1e-06 \
    nEnergyDiff          2 \
    alphaTstart          1 \
    alphaTmin            1e-10 \
    updateTestStepSize   yes \
    alphaTreduceFactor   0.1 \
    alphaTincreaseFactor 3 \
    nAlphaAdjustMax      3 \
    wolfeEnergy          0.0001 \
    wolfeGradient        0.9 \
    fdTest               no
kpoint   0.000000000000   0.000000000000   0.000000000000  1.00000000000000
kpoint-folding 2 3 2 
latt-move-scale 1 1 1
latt-scale 1 1 1 
lattice  \
      22.928577882949284    0.000000000000000  -10.094230904711511  \
       0.000000000000000   12.457297591368569    0.000000000000000  \
       0.069314861190520    0.000000000000000   18.327192900226532 
lattice-minimize  \
    dirUpdateScheme      L-BFGS \
    linminMethod         DirUpdateRecommended \
    nIterations          0 \
    history              15 \
    knormThreshold       0 \
    energyDiffThreshold  1e-06 \
    nEnergyDiff          2 \
    alphaTstart          1 \
    alphaTmin            1e-10 \
    updateTestStepSize   yes \
    alphaTreduceFactor   0.1 \
    alphaTincreaseFactor 3 \
    nAlphaAdjustMax      3 \
    wolfeEnergy          0.0001 \
    wolfeGradient        0.9 \
    fdTest               no
lcao-params -1 1e-06 0.0095
pcm-variant GLSSA13
spintype no-spin
subspace-rotation-factor 1 yes
symmetries automatic
symmetry-threshold 0.001

---------- Setting up symmetries ----------

Non-trivial transmission matrix:
[   1   0   0  ]
[   0   1   0  ]
[   1   0   1  ]
with reduced lattice vectors:
[     12.834347      0.000000    -10.094231  ]
[      0.000000     12.457298      0.000000  ]
[     18.396508      0.000000     18.327193  ]

Found 4 point-group symmetries of the bravais lattice
Found 8 space-group symmetries with basis
Applied RMS atom displacement 5.57399e-15 bohrs to make symmetries exact.

---------- Initializing the Grid ----------
R = 
[      22.9286            0     -10.0942  ]
[            0      12.4573            0  ]
[    0.0693149            0      18.3272  ]
unit cell volume = 5243.48
G =
[   0.273577         -0   0.150681  ]
[          0   0.504378         -0  ]
[ -0.00103469          0   0.342264  ]
Minimum fftbox size, Smin = [  68  36  60  ]
Chosen fftbox size, S = [  70  36  60  ]

---------- Exchange Correlation functional ----------
Initalized PBE GGA exchange.
Initalized PBE GGA correlation.

---------- Setting up pseudopotentials ----------
Width of ionic core gaussian charges (only for fluid interactions / plotting) set to 0

Reading pseudopotential file '/net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Si_ONCV_PBE.upf':
  'Si' pseudopotential, 'PBE' functional
  Generated using ONCVPSP code by D. R. Hamann
  Author: Martin Schlipf and Francois Gygi  Date: 150915.
  4 valence electrons, 2 orbitals, 4 projectors, 1510 radial grid points, with lMax = 1
  Transforming local potential to a uniform radial grid of dG=0.02 with 1024 points.
  Transforming nonlocal projectors to a uniform radial grid of dG=0.02 with 307 points.
    3S    l: 0   occupation:  2.0   eigenvalue: -0.397365
    3P    l: 1   occupation:  2.0   eigenvalue: -0.149981
  Transforming atomic orbitals to a uniform radial grid of dG=0.02 with 307 points.
  Core radius for overlap checks: 2.98 bohrs.
  Reading pulay file /net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Si_ONCV_PBE.pulay ... using dE_dnG = -1.274872e-03 computed for Ecut = 10.

Reading pseudopotential file '/net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Na_ONCV_PBE.upf':
  'Na' pseudopotential, 'PBE' functional
  Generated using ONCVPSP code by D. R. Hamann
  Author: Martin Schlipf and Francois Gygi  Date: 150915.
  9 valence electrons, 3 orbitals, 4 projectors, 1992 radial grid points, with lMax = 1
  Transforming local potential to a uniform radial grid of dG=0.02 with 1024 points.
  Transforming nonlocal projectors to a uniform radial grid of dG=0.02 with 307 points.
    2S    l: 0   occupation:  2.0   eigenvalue: -2.085640
    2P    l: 1   occupation:  6.0   eigenvalue: -1.053708
    3S    l: 0   occupation:  1.0   eigenvalue: -0.100838
  Transforming atomic orbitals to a uniform radial grid of dG=0.02 with 307 points.
  Core radius for overlap checks: 2.01 bohrs.
  Reading pulay file /net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Na_ONCV_PBE.pulay ... using dE_dnG = -1.681960e+00 computed for Ecut = 10.

Initialized 2 species with 32 total atoms.

Folded 1 k-points by 2x3x2 to 12 k-points.

---------- Setting up k-points, bands, fillings ----------
Reduced to 8 k-points under symmetry. 
Computing the number of bands and number of electrons
Calculating initial fillings.
nElectrons: 208.000000   nBands: 144   nStates: 8

----- Setting up reduced wavefunction bases (one per k-point) -----
average nbasis = 7919.083 , ideal nbasis = 7919.786

----- Initializing Supercell corresponding to k-point mesh -----
Lattice vector linear combinations in supercell:
[   2   0   0  ]
[   0   3   0  ]
[   0   0   2  ]
Supercell lattice vectors:
[  45.8572  0  -20.1885  ]
[  0  37.3719  0  ]
[  0.13863  0  36.6544  ]

---------- Setting up ewald sum ----------
Optimum gaussian width for ewald sums = 3.889779 bohr.
Real space sum over 567 unit cells with max indices [  3  4  4  ]
Reciprocal space sum over 5733 terms with max indices [  10  6  10  ]

---------- Allocating electronic variables ----------
Initializing wave functions:  linear combination of atomic orbitals
Si pseudo-atom occupations:   s ( 2 )  p ( 2 )
Na pseudo-atom occupations:   s ( 2 1 )  p ( 6 )
    FillingsUpdate:  mu: +0.201822279  nElectrons: 208.000000
LCAOMinimize: Iter:   0  F: -729.2090650262059626  |grad|_K:  1.689e-03  alpha:  1.000e+00
    FillingsUpdate:  mu: +0.181357224  nElectrons: 208.000000
LCAOMinimize: Iter:   1  F: -729.3693863155830286  |grad|_K:  3.575e-04  alpha:  6.002e-01  linmin: -2.079e-02  cgtest:  1.523e-01  t[s]:      2.90
    FillingsUpdate:  mu: +0.181985758  nElectrons: 208.000000
LCAOMinimize: Iter:   2  F: -729.3766674728333328  |grad|_K:  1.384e-04  alpha:  6.722e-01  linmin:  1.997e-03  cgtest: -2.733e-02  t[s]:      3.96
    FillingsUpdate:  mu: +0.181879253  nElectrons: 208.000000
LCAOMinimize: Iter:   3  F: -729.3773034865737372  |grad|_K:  2.609e-05  alpha:  3.977e-01  linmin: -1.111e-04  cgtest:  4.285e-02  t[s]:      4.90
    FillingsUpdate:  mu: +0.181657222  nElectrons: 208.000000
LCAOMinimize: Iter:   4  F: -729.3773419649966172  |grad|_K:  6.791e-06  alpha:  6.718e-01  linmin: -4.698e-04  cgtest:  1.067e-03  t[s]:      5.88
    FillingsUpdate:  mu: +0.181726236  nElectrons: 208.000000
LCAOMinimize: Iter:   5  F: -729.3773446574464288  |grad|_K:  9.288e-07  alpha:  6.934e-01  linmin: -1.074e-04  cgtest: -1.541e-02  t[s]:      6.94
    FillingsUpdate:  mu: +0.181726671  nElectrons: 208.000000
LCAOMinimize: Iter:   6  F: -729.3773446824303619  |grad|_K:  1.622e-07  alpha:  3.443e-01  linmin: -7.979e-06  cgtest: -6.935e-04  t[s]:      7.95
    FillingsUpdate:  mu: +0.181725395  nElectrons: 208.000000
LCAOMinimize: Iter:   7  F: -729.3773446840687029  |grad|_K:  3.444e-08  alpha:  7.441e-01  linmin:  5.051e-03  cgtest: -2.493e-02  t[s]:      8.94
LCAOMinimize: Converged (|Delta F|<1.000000e-06 for 2 iters).

---- Citations for features of the code used in this run ----

   Software package:
      R. Sundararaman, K. Letchworth-Weaver, K.A. Schwarz, D. Gunceler, Y. Ozhabes and T.A. Arias, 'JDFTx: software for joint density-functional theory', SoftwareX 6, 278 (2017)

   gga-PBE exchange-correlation functional:
      J.P. Perdew, K. Burke and M. Ernzerhof, Phys. Rev. Lett. 77, 3865 (1996)

   Pseudopotentials:
      M Schlipf and F Gygi, Comput. Phys. Commun. 196, 36 (2015)

   Total energy minimization with Auxiliary Hamiltonian:
      C. Freysoldt, S. Boeck, and J. Neugebauer, Phys. Rev. B 79, 241103(R) (2009)

   Linear-tetrahedron sampling for density of states:
      G. Lehmann and M. Taut, Phys. status solidi (b) 54, 469 (1972)

This list may not be complete. Please suggest additional citations or
report any other bugs at https://github.com/shankar1729/jdftx/issues

Initialization completed successfully at t[s]:      9.37

-------- Electronic minimization -----------
    FillingsUpdate:  mu: +0.181725395  nElectrons: 208.000000
ElecMinimize: Iter:   0  F: -729.377344684068930  |grad|_K:  5.833e-04  alpha:  1.000e+00
    FillingsUpdate:  mu: +0.174884513  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1
ElecMinimize: Iter:   1  F: -730.620281943531040  |grad|_K:  1.901e-04  alpha:  3.972e-01  linmin:  4.841e-04  t[s]:     12.41
    FillingsUpdate:  mu: +0.169771004  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.03
ElecMinimize: Iter:   2  F: -730.837248274446893  |grad|_K:  1.239e-04  alpha:  6.529e-01  linmin: -7.589e-05  t[s]:     14.54
    FillingsUpdate:  mu: +0.164481604  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.06
ElecMinimize: Iter:   3  F: -730.931821771194222  |grad|_K:  7.164e-05  alpha:  6.679e-01  linmin: -7.699e-06  t[s]:     16.39
    FillingsUpdate:  mu: +0.162754975  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.1
ElecMinimize: Iter:   4  F: -730.958002708480421  |grad|_K:  4.985e-05  alpha:  5.536e-01  linmin:  3.191e-05  t[s]:     18.70
    FillingsUpdate:  mu: +0.162285216  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.13
ElecMinimize: Iter:   5  F: -730.970350330141514  |grad|_K:  3.494e-05  alpha:  5.406e-01  linmin:  5.257e-05  t[s]:     20.82
    FillingsUpdate:  mu: +0.161943814  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.16
ElecMinimize: Iter:   6  F: -730.976393429656810  |grad|_K:  2.271e-05  alpha:  5.401e-01  linmin:  4.563e-05  t[s]:     22.79
    FillingsUpdate:  mu: +0.161302037  nElectrons: 208.000000

Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.force' ... done
Dumping 'X.eigenvals' ... done
    SubspaceRotationAdjust: set factor to 1.23
ElecMinimize: Iter:   7  F: -730.978978326052356  |grad|_K:  1.639e-05  alpha:  5.468e-01  linmin:  2.597e-05  t[s]:     24.81
    FillingsUpdate:  mu: +0.160892251  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.26
ElecMinimize: Iter:   8  F: -730.980207120993555  |grad|_K:  1.050e-05  alpha:  4.976e-01  linmin: -8.406e-08  t[s]:     26.56
    FillingsUpdate:  mu: +0.160863060  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.29
ElecMinimize: Iter:   9  F: -730.980785433417964  |grad|_K:  6.906e-06  alpha:  5.700e-01  linmin: -2.891e-06  t[s]:     28.63
    FillingsUpdate:  mu: +0.160857671  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.35
ElecMinimize: Iter:  10  F: -730.981049912807066  |grad|_K:  4.890e-06  alpha:  6.024e-01  linmin:  8.936e-06  t[s]:     30.82
    FillingsUpdate:  mu: +0.160793025  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.39
ElecMinimize: Iter:  11  F: -730.981171827026969  |grad|_K:  3.339e-06  alpha:  5.543e-01  linmin:  9.365e-06  t[s]:     32.90
    FillingsUpdate:  mu: +0.160772038  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.43
ElecMinimize: Iter:  12  F: -730.981231201059586  |grad|_K:  2.236e-06  alpha:  5.791e-01  linmin:  8.257e-06  t[s]:     34.81
    FillingsUpdate:  mu: +0.160768562  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.46
ElecMinimize: Iter:  13  F: -730.981258513158764  |grad|_K:  1.651e-06  alpha:  5.942e-01  linmin:  3.419e-06  t[s]:     36.63
    FillingsUpdate:  mu: +0.160745566  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.5
ElecMinimize: Iter:  14  F: -730.981272397531598  |grad|_K:  1.193e-06  alpha:  5.536e-01  linmin:  1.708e-06  t[s]:     38.65
    FillingsUpdate:  mu: +0.160741943  nElectrons: 208.000000

Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.force' ... done
Dumping 'X.eigenvals' ... done
    SubspaceRotationAdjust: set factor to 1.54
ElecMinimize: Iter:  15  F: -730.981279634896055  |grad|_K:  8.396e-07  alpha:  5.527e-01  linmin:  1.588e-06  t[s]:     41.20
    FillingsUpdate:  mu: +0.160753862  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.55
ElecMinimize: Iter:  16  F: -730.981283228877373  |grad|_K:  6.225e-07  alpha:  5.542e-01  linmin:  1.301e-06  t[s]:     43.76
    FillingsUpdate:  mu: +0.160749440  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.61
ElecMinimize: Iter:  17  F: -730.981285325200702  |grad|_K:  4.516e-07  alpha:  5.880e-01  linmin:  1.748e-06  t[s]:     45.86
    FillingsUpdate:  mu: +0.160741128  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.61
ElecMinimize: Iter:  18  F: -730.981286478297307  |grad|_K:  3.254e-07  alpha:  6.149e-01  linmin:  1.433e-06  t[s]:     48.50
    FillingsUpdate:  mu: +0.160744203  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.63
ElecMinimize: Iter:  19  F: -730.981287080700554  |grad|_K:  2.511e-07  alpha:  6.186e-01  linmin:  8.838e-07  t[s]:     50.52
    FillingsUpdate:  mu: +0.160743209  nElectrons: 208.000000
    SubspaceRotationAdjust: set factor to 1.65
ElecMinimize: Iter:  20  F: -730.981287437568767  |grad|_K:  1.882e-07  alpha:  6.152e-01  linmin:  6.871e-07  t[s]:     52.45
ElecMinimize: Converged (|Delta F|<1.000000e-06 for 2 iters).
Setting wave functions to eigenvectors of Hamiltonian

# Ionic positions in lattice coordinates:
ion Si   0.061602711441747   0.292062989005408   0.186816515940134 1
ion Si   0.938397257183252   0.707937011744592   0.813183447809866 1
ion Si   0.938397257183252   0.292062989005408   0.313183447809866 1
ion Si   0.061602711441747   0.707937011744592   0.686816515940134 1
ion Si   0.561602711441747   0.792062989005408   0.186816515940134 1
ion Si   0.438397257183252   0.207937011744592   0.813183447809866 1
ion Si   0.438397257183252   0.792062989005408   0.313183447809866 1
ion Si   0.561602711441747   0.207937011744592   0.686816515940134 1
ion Si   0.098290478501462   0.037391893097691   0.357765039787784 1
ion Si   0.901709490123537   0.962608107652309   0.642234923962216 1
ion Si   0.901709490123537   0.037391893097691   0.142234923962216 1
ion Si   0.098290478501462   0.962608107652309   0.857765039787784 1
ion Si   0.598290478501462   0.537391893097691   0.357765039787784 1
ion Si   0.401709490123537   0.462608107652309   0.642234923962216 1
ion Si   0.401709490123537   0.537391893097691   0.142234923962216 1
ion Si   0.598290478501462   0.462608107652309   0.857765039787784 1
ion Na   0.351190096562829   0.338260596131203   0.358380793977603 1
ion Na   0.648809872062171   0.661739404618797   0.641619169772397 1
ion Na   0.648809872062171   0.338260596131203   0.141619169772397 1
ion Na   0.351190096562829   0.661739404618797   0.858380793977603 1
ion Na   0.851190096562829   0.838260596131203   0.358380793977603 1
ion Na   0.148809872062171   0.161739404618797   0.641619169772397 1
ion Na   0.148809872062171   0.838260596131203   0.141619169772397 1
ion Na   0.851190096562829   0.161739404618797   0.858380793977603 1
ion Na   0.369136137512290   0.093767271292924   0.048717562509565 1
ion Na   0.630863831112710   0.906232729457076   0.951282401240435 1
ion Na   0.630863831112710   0.093767271292924   0.451282401240435 1
ion Na   0.369136137512290   0.906232729457076   0.548717562509565 1
ion Na   0.869136137512290   0.593767271292924   0.048717562509565 1
ion Na   0.130863831112710   0.406232729457076   0.951282401240435 1
ion Na   0.130863831112710   0.593767271292924   0.451282401240435 1
ion Na   0.869136137512290   0.406232729457076   0.548717562509565 1

# Forces in Lattice coordinates:
force Si   0.055038867847070   0.018758366364779  -0.040478332871196 1
force Si  -0.055038867847070  -0.018758366364779   0.040478332871196 1
force Si  -0.055038867847070   0.018758366364779   0.040478332871196 1
force Si   0.055038867847070  -0.018758366364779  -0.040478332871196 1
force Si   0.055038867847070   0.018758366364780  -0.040478332871195 1
force Si  -0.055038867847074  -0.018758366364779   0.040478332871197 1
force Si  -0.055038867847070   0.018758366364779   0.040478332871196 1
force Si   0.055038867847070  -0.018758366364779  -0.040478332871196 1
force Si   0.032246678913983  -0.017937318402096   0.022064496004864 1
force Si  -0.032246678913982   0.017937318402096  -0.022064496004864 1
force Si  -0.032246678913985  -0.017937318402096  -0.022064496004864 1
force Si   0.032246678913984   0.017937318402096   0.022064496004864 1
force Si   0.032246678913983  -0.017937318402096   0.022064496004864 1
force Si  -0.032246678913982   0.017937318402096  -0.022064496004864 1
force Si  -0.032246678913984  -0.017937318402096  -0.022064496004864 1
force Si   0.032246678913985   0.017937318402096   0.022064496004864 1
force Na  -0.003293935111161  -0.008850391847747   0.001948435288175 1
force Na   0.003293935111160   0.008850391847747  -0.001948435288174 1
force Na   0.003293935111160  -0.008850391847747  -0.001948435288175 1
force Na  -0.003293935111159   0.008850391847747   0.001948435288174 1
force Na  -0.003293935111162  -0.008850391847747   0.001948435288174 1
force Na   0.003293935111160   0.008850391847747  -0.001948435288174 1
force Na   0.003293935111160  -0.008850391847747  -0.001948435288174 1
force Na  -0.003293935111161   0.008850391847747   0.001948435288174 1
force Na   0.001163315121082  -0.002183149719639   0.005426023666742 1
force Na  -0.001163315121082   0.002183149719639  -0.005426023666743 1
force Na  -0.001163315121082  -0.002183149719639  -0.005426023666742 1
force Na   0.001163315121082   0.002183149719639   0.005426023666742 1
force Na   0.001163315121082  -0.002183149719639   0.005426023666742 1
force Na  -0.001163315121082   0.002183149719639  -0.005426023666742 1
force Na  -0.001163315121082  -0.002183149719639  -0.005426023666742 1
force Na   0.001163315121082   0.002183149719639   0.005426023666743 1

# Energy components:
   Eewald =     -382.9718366837728354
       EH =      243.5527654332317127
     Eloc =     -648.9546805863952841
      Enl =     -170.3465207339560834
   Epulay =       -0.0036071722955318
      Exc =     -114.9676103850727742
       KE =      342.7346635440668479
-------------------------------------
     Etot =     -730.9568265841941184
       TS =        0.0244608533746059
-------------------------------------
        F =     -730.9812874375687670

Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.ionpos' ... done
Dumping 'X.force' ... done
Dumping 'X.lattice' ... done
Dumping 'X.eigenvals' ... done
IonicMinimize: Iter:   0  F: -730.981287437568767  |grad|_K:  1.205e-03  t[s]:     53.05
IonicMinimize: None of the convergence criteria satisfied after 0 iterations.

#--- Lowdin population analysis ---
# oxidation-state Si -0.527 -0.527 -0.527 -0.527 -0.527 -0.527 -0.527 -0.527 -0.502 -0.502 -0.502 -0.502 -0.502 -0.502 -0.502 -0.502
# oxidation-state Na +0.609 +0.609 +0.609 +0.609 +0.609 +0.609 +0.609 +0.609 +0.579 +0.579 +0.579 +0.579 +0.579 +0.579 +0.579 +0.579

Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.ionpos' ... done
Dumping 'X.lattice' ... done
Dumping 'X.eigenvals' ... done
Dumping 'X.eigStats' ... 
    eMin: -2.011224 at state 3 ( [ +0.000000 +0.333333 +0.500000 ] spin 0 )
    HOMO: +0.135751 at state 0 ( [ +0.000000 +0.000000 +0.000000 ] spin 0 )
    mu  : +0.160743
    LUMO: +0.173235 at state 0 ( [ +0.000000 +0.000000 +0.000000 ] spin 0 )
    eMax: +0.410730 at state 2 ( [ +0.000000 +0.333333 +0.000000 ] spin 0 )
    HOMO-LUMO gap: +0.037484
    Optical gap  : +0.037484 at state 0 ( [ +0.000000 +0.000000 +0.000000 ] spin 0 )
Dumping 'X.dos' ... done.
End date and time: Thu Oct  5 15:33:34 2023  (Duration: 0-0:00:53.37)
Done!
ximik69 commented 9 months ago

Overall trouble with OpenMPI, running by slurm, is that it shows sensible binding maps but thread binding screws up:

[ac0116:3754220] MCW rank 1 bound to socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]], socket 0[core 16[hwt 0]], socket 0[core 17[hwt 0]], socket 0[core 18[hwt 0]], socket 0[core 19[hwt 0]], socket 0[core 20[hwt 0]], socket 0[core 21[hwt 0]], socket 0[core 22[hwt 0]], socket 0[core 23[hwt 0]]: [././././././././././././B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././.]
[ac0116:3754220] MCW rank 2 bound to socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]], socket 1[core 26[hwt 0]], socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]], socket 1[core 30[hwt 0]], socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]], socket 1[core 34[hwt 0]], socket 1[core 35[hwt 0]]: [./././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././.]
[ac0116:3754220] MCW rank 3 bound to socket 1[core 36[hwt 0]], socket 1[core 37[hwt 0]], socket 1[core 38[hwt 0]], socket 1[core 39[hwt 0]], socket 1[core 40[hwt 0]], socket 1[core 41[hwt 0]], socket 1[core 42[hwt 0]], socket 1[core 43[hwt 0]], socket 1[core 44[hwt 0]], socket 1[core 45[hwt 0]], socket 1[core 46[hwt 0]], socket 1[core 47[hwt 0]]: [./././././././././././././././././././././././.][././././././././././././B/B/B/B/B/B/B/B/B/B/B/B]
[ac0116:3754220] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././.][./././././././././././././././././././././././.]
shankar1729 commented 9 months ago

How are you determining that the thread binding is broken after this report? Also, try with a larger supercell calculation to make sure you have enough work to keep the threads busy.

ximik69 commented 9 months ago

When I start JDFTx using sbatch I can sometimes attach to working node and run htop. And I see problems. I tested this with bigger task, namely cutoff 50Ha and 3x5x3 k-point mesh and it is clearly visible.

Unfortunately I can not attach to node, running interactive session.

ximik69 commented 9 months ago

I tried several times to attach to node running interactive session with mpirun ... jdftx ... but it seems impossible. I tried srun -N1 -n1 --jobid=$1 --pty /bin/bash and got: srun: Job 5295157 step creation temporarily disabled, retrying (Requested nodes are busy)

I can not ssh to running nodes too, cause access is not permitted.

trying "at now", then mpirun ... jdftx ... does not work too.

Therefore I can not test your recipe due to restrictions on cluster. But can I somehow run mpirun+jdftx totally independent from slurm via sbatch and see cpu load with htop?

I heard on seminar about LUMI (AMD cpus), that combination OpenMPI+slurm kinda sucks because of some problems with placement.

ximik69 commented 9 months ago
Also, did you try tweaking the cpu binding flags in sbatch/srun instead of the openMPI ones?

No, I did not know about that.

shankar1729 commented 9 months ago

I'd suggest trying with a different MPI build then, eg. MPICH or MVAPICH2.

ximik69 commented 9 months ago

I succeeded to run jdftx +htop in this way:

mpiexec --report-bindings --bind-to none -n 2 jdftx $dbg -m -c 12 -i X.in -o X.out & htop

Not sure whether this is correct. But it seems to be that binding issue does not exist if running independent from slurm.

ximik69 commented 9 months ago

Dear Shankar, thank you very much for your hints. I estimate that cores are loaded approx. 70% of time if running a big task independent from slurm.

Still can not attach screenshots.

Best wishes, Igor.

ximik69 commented 9 months ago

Hello dear Shankar. A small conclusion. If i run JDFTx by OpenMPI as sbatch task, thread binding fails. But if I run it in interactive session if works fine.

I'll try other mpi programs later and see whether problem persists.

Best wishes, Igor.

ximik69 commented 8 months ago

A short update. I have built JDFTx with MVAPICH. It gives the same situation as with OpenMPI under the default settings.

But there are interesting parameters in the documentation http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-userguide.pdf: MV2 CPU BINDING POLICY=hybrid MV2 THREADS PER PROCESS= MV2 HYBRID BINDING POLICY=

Let's see what happens.

shankar1729 commented 8 months ago

The issue might once again be SLURM overridding the settings when you go through it. Check whether these variables can override the SLURM settings, if that's the case.

ximik69 commented 8 months ago

MV2_HYBRID_BINDING_POLICY=linear gives 12 running cores of 12 desired. However thread locality is bad and calculations are slow (10sec iteration/12 total cores)

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH -p plgrid-testing
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12

# job name
#SBATCH -J Na_conv
#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err

# write this script to stdout-file - useful for scripting errors
#cat $0

cd $SLURM_SUBMIT_DIR

. /net/people/plgrid/plgigoro/.bashrc

module purge

module load jdftx_mvapich

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_CPU_BINDING_POLICY=hybrid
export MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK
export MV2_HYBRID_BINDING_POLICY=linear

dbg=""
#dbg="-n"

dbg=""

runit="mpiexec -np $SLURM_NTASKS_PER_NODE jdftx $dbg -m -c $SLURM_CPUS_PER_TASK -i X.in -o X.out"

$runit

Next.

ximik69 commented 8 months ago

MV2_HYBRID_BINDING_POLICY=compact is the same (no hyperthreading).

ximik69 commented 8 months ago

MV2_HYBRID_BINDING_POLICY=numa is the similarly slow (10sec iteration/1MPI process 12 total cores) but CPU load oscillates.

ximik69 commented 8 months ago

MV2_HYBRID_BINDING_POLICY=bunch is an epic fail (18sec LCAOminimize iteration/1MPI process 12 total cores instead of 2-3 sec as with previous settings).

ximik69 commented 8 months ago

MV2_HYBRID_BINDING_POLICY=scatter is also slow.

ximik69 commented 8 months ago

The next thing I tried was to print cpu bindings.

With the script below

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH -p plgrid-testing
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12

# job name
#SBATCH -J Na_conv
#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err

# write this script to stdout-file - useful for scripting errors
cat $0

cd $SLURM_SUBMIT_DIR

. /net/people/plgrid/plgigoro/.bashrc

module purge

module load jdftx_mvapich

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_CPU_BINDING_POLICY=hybrid
export MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK
export MV2_HYBRID_BINDING_POLICY=numa
export MV2_SHOW_CPU_BINDING=1
#export MPICH_DBG_OUTPUT=VERBOSE
#export MPICH_DBG_CLASS=ALL
#export MPICH_DBG_FILENAME="dbg-%w-%d.log"
env

dbg=""
#dbg="-n"

dbg=""

runit="mpiexec -np $SLURM_NTASKS_PER_NODE -env MPICH_DBG_OUTPUT=VERBOSE -env MPICH_DBG_CLASS=ALL -env MPICH_DBG_FILENAME="dbg-%w-%d.log" jdftx $dbg -m -c $SLURM_CPUS_PER_TASK -i X.in -o X.out"

$runit

I obtained

-------------CPU AFFINITY-------------
OMP_NUM_THREADS           : 12
MV2_THREADS_PER_PROCESS   : 12
MV2_CPU_BINDING_POLICY    : Hybrid
MV2_HYBRID_BINDING_POLICY : Linear
--------------------------------------
RANK: 0  CPU_SET:    0   1   2   3   7   8   9  13  14  15  19  20; NUMA: 0  Socket: 0
RANK: 1  CPU_SET:    4   5   6  10  11  12  16  17  18  21  22  23; NUMA: 1  Socket: 0
-------------------------------------

Note difference when I try to start with MV2_HYBRID_BINDING_POLICY=numa

I get MV2_HYBRID_BINDING_POLICY : Linear

Strange CPU bindings clearly explain why I have so small performance.

ximik69 commented 8 months ago

The issue might once again be SLURM overridding the settings when you go through it. Check whether these variables can override the SLURM settings, if that's the case.

Dear Shankar thank you for the explanation.

I'll try to print slurm binding params and see what it shows.

Best wishes, Igor.

ximik69 commented 8 months ago

Upd. I've read here (https://slurm.schedmd.com/cpu_management.html) about srun option: --cpu-bind=verbose, but I don't know how to print cpu binding settings for slurm using sbatch. ''' grep SLURM process_5394096.out ''' does not show anything informative:

cd $SLURM_SUBMIT_DIR
#export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#export MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK
runit="mpiexec -np $SLURM_NTASKS_PER_NODE \
       -env OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK \
       -env MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK \
       -env MPICH_DBG_FILENAME="dbg-%w-%d.log" jdftx $dbg -m -c $SLURM_CPUS_PER_TASK -i X.in -o X.out"
SLURM_MEM_PER_CPU=3850
SLURM_NODEID=0
SLURM_TASK_PID=3363709
SLURM_PRIO_PROCESS=0
SLURM_SUBMIT_DIR=/net/ascratch/people/plgigoro/JDFTx_test10mvapich
SLURM_CPUS_PER_TASK=12
SLURM_PROCID=0
SLURM_JOB_GID=100000
SLURMD_NODENAME=ac0543
SLURM_JOB_END_TIME=1697106836
SLURM_TASKS_PER_NODE=2
SLURM_NNODES=1
SLURM_JOB_START_TIME=1697106236
SLURM_NTASKS_PER_NODE=2
SLURM_JOB_NODELIST=ac0543
SLURM_CLUSTER_NAME=ares
SLURM_NODELIST=ac0543
SLURM_NTASKS=2
SLURM_JOB_CPUS_PER_NODE=24
SLURM_TOPOLOGY_ADDR=core.island2.p0h03c02.ac0543
SLURM_WORKING_CLUSTER=ares:slurm01:6817:9984:109
SLURM_JOB_NAME=Na_conv
SLURM_JOBID=5394096
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_NODE_ALIASES=(null)
SLURM_JOB_QOS=normal
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.node
SLURM_CPUS_ON_NODE=24
SLURM_JOB_NUM_NODES=1
SLURM_JOB_UID=114522
SLURM_JOB_PARTITION=plgrid-testing
SLURM_SCRIPT_CONTEXT=prolog_task
SLURM_JOB_USER=plgigoro
SLURM_NPROCS=2
SLURM_SUBMIT_HOST=login01.ares.cyfronet.pl
SLURM_JOB_ACCOUNT=plgzl3a-cpu
SLURM_GTIDS=0
SLURM_JOB_ID=5394096
SLURM_LOCALID=0

What can I do next?

shankar1729 commented 8 months ago

I think you'll need to consult with your HPC sysadmins again since this seems to be specific to your cluster's setup. I don't think this is specific to JDFTx in any way, but rather broken support for MPI + threads in general. This is quite likely as a lot of programs tend to be pure MPI rather than hybrid MPI+thread parallelized, leading to lack of support for this case.

Additionally, you may want to create a simple program that just does a dummy loop in several threads as a minimal example to help debug your HPC people debug this issue. See these pages for a starting point:

https://enccs.github.io/intermediate-mpi/mpi-and-threads-pt2/

https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00114008en_us&page=Run_an_OpenMP_Application.html

Best, Shankar

ximik69 commented 8 months ago

Hi dear Shankar,

thanks a lot for your help!!! Actually I am in contact with cluster sysadmins, but I got stuck and possibly they too. I posted them link to this thread. I was pretty sure that problem is not with JDFTx but broken thread binding/affinity/whatsoever on cluster. I contacted you because you as an author of the program know it and its dependencies the best. For instance I previously thought, that JDFTx used OpenMP instead of pthreads and thank you that you corrected me.

You gave me a good simple program to try to debug issues.

I figured out previously, that the cpu numbers for one mpi rank are often very strange, even starting from 1 and not 0. Possibly this is due to allocation of cores without much concerns about their locality.

I compiled the program you told me and saw that it uses OpenMP threads. I'll try to play with MPI/sbatch parameters in order to improve the situation. But could be there any difference in affinity between threads from libpthread and OpenMP?

Thank you once again for your kind help.

Best wishes, Igor.

shankar1729 commented 8 months ago

Indeed, but I have reached the limit of how much I know about this :).

It may be that OpenMP uses some of those environment variables that pthreads does not, but under the hood OpenMP ultimately reduces down to pthreads or equivalent in many implementations. In your case, most likely the issue is coming at the SLURM level, since things work fine outside of job scripts.

Best, Shankar

ximik69 commented 8 months ago

Ok, thank you very much.

Best wishes, Igor.

ximik69 commented 7 months ago

Hi dear Shankar, I have found working solution for correct binding, at least for the test program xthi. Thank you once again Shankar for mentioning program xthi.

With the help of admins, and some receipts for LUMI cluster I finally came to the following script:

#!/bin/bash
#SBATCH -t 00:00:10
#SBATCH -N 1
#SBATCH -p plgrid-testing
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12

#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err

# write this script to stdout-file - useful for scripting errors
cat $0

module purge

module load jdftx/1.7.0-foss-2021b-mkl

#1st mask 12 cores
mask0="FFF"
#2nd mask 12 cores
mask1=$mask0"000"
#3rd mask 12 cores
mask2=$mask1"000"
#4th mask 12 cores
mask3=$mask2"000"
echo $mask0
echo $mask1
echo $mask2
echo $mask3
CPU_BIND="mask_cpu:${mask0},${mask1},${mask2},${mask3}"

export OMP_NUM_THREADS=12
#export OMP_PLACES=cores
srun --mpi=pmix --cpu-bind=${CPU_BIND}  xthi | sort -n -k 4 -k 6

I'll test this with jdftx, to see whether I can obtain full cpu load using MPI+pthreads.

The most interesting part of the output is:

Hello from rank 0, thread 0, on ac0082. (core affinity = 0)
Hello from rank 0, thread 1, on ac0082. (core affinity = 1)
Hello from rank 0, thread 2, on ac0082. (core affinity = 2)
Hello from rank 0, thread 3, on ac0082. (core affinity = 3)
Hello from rank 0, thread 4, on ac0082. (core affinity = 4)
Hello from rank 0, thread 5, on ac0082. (core affinity = 5)
Hello from rank 0, thread 6, on ac0082. (core affinity = 6)
Hello from rank 0, thread 7, on ac0082. (core affinity = 7)
Hello from rank 0, thread 8, on ac0082. (core affinity = 8)
Hello from rank 0, thread 9, on ac0082. (core affinity = 9)
Hello from rank 0, thread 10, on ac0082. (core affinity = 10)
Hello from rank 0, thread 11, on ac0082. (core affinity = 11)
Hello from rank 1, thread 0, on ac0082. (core affinity = 12)
Hello from rank 1, thread 1, on ac0082. (core affinity = 13)
Hello from rank 1, thread 2, on ac0082. (core affinity = 14)
Hello from rank 1, thread 3, on ac0082. (core affinity = 15)
Hello from rank 1, thread 4, on ac0082. (core affinity = 16)
Hello from rank 1, thread 5, on ac0082. (core affinity = 17)
Hello from rank 1, thread 6, on ac0082. (core affinity = 18)
Hello from rank 1, thread 7, on ac0082. (core affinity = 19)
Hello from rank 1, thread 8, on ac0082. (core affinity = 20)
Hello from rank 1, thread 9, on ac0082. (core affinity = 21)
Hello from rank 1, thread 10, on ac0082. (core affinity = 22)
Hello from rank 1, thread 11, on ac0082. (core affinity = 23)
Hello from rank 2, thread 0, on ac0082. (core affinity = 24)
Hello from rank 2, thread 1, on ac0082. (core affinity = 25)
Hello from rank 2, thread 2, on ac0082. (core affinity = 26)
Hello from rank 2, thread 3, on ac0082. (core affinity = 27)
Hello from rank 2, thread 4, on ac0082. (core affinity = 28)
Hello from rank 2, thread 5, on ac0082. (core affinity = 29)
Hello from rank 2, thread 6, on ac0082. (core affinity = 30)
Hello from rank 2, thread 7, on ac0082. (core affinity = 31)
Hello from rank 2, thread 8, on ac0082. (core affinity = 32)
Hello from rank 2, thread 9, on ac0082. (core affinity = 33)
Hello from rank 2, thread 10, on ac0082. (core affinity = 34)
Hello from rank 2, thread 11, on ac0082. (core affinity = 35)
Hello from rank 3, thread 0, on ac0082. (core affinity = 36)
Hello from rank 3, thread 1, on ac0082. (core affinity = 37)
Hello from rank 3, thread 2, on ac0082. (core affinity = 38)
Hello from rank 3, thread 3, on ac0082. (core affinity = 39)
Hello from rank 3, thread 4, on ac0082. (core affinity = 40)
Hello from rank 3, thread 5, on ac0082. (core affinity = 41)
Hello from rank 3, thread 6, on ac0082. (core affinity = 42)
Hello from rank 3, thread 7, on ac0082. (core affinity = 43)
Hello from rank 3, thread 8, on ac0082. (core affinity = 44)
Hello from rank 3, thread 9, on ac0082. (core affinity = 45)
Hello from rank 3, thread 10, on ac0082. (core affinity = 46)
Hello from rank 3, thread 11, on ac0082. (core affinity = 47)

Best wishes, Igor.

shankar1729 commented 7 months ago

Great to hear that, hope it works for JDFTx next too!

ximik69 commented 7 months ago

Thank you. Still waiting in the queue.

ximik69 commented 7 months ago

Hello dear Shankar.

Small update, after examining the output of hpc-jobs-history, the total cpu usage was about ~15%, which was not nice. Later I installed mpiP to profile MPI communication and exclude MPI problems w/o recompiling. In spite of difficult install, it showed that I had kinda 9-21% time spent in MPI communications on my now typical usage pattern : [number of kpoints] processes x 2 threads each. If mpiP was correct in spite inability to compile fortran tests it is at least acceptable. The situation with preinstalled gcc+mkl+fftw+gsl+libxc+openmpi is such, that main threads of mpi processes always use 100% cpu but additional threads work only intermittently for short time, then often idling. Preinstalled FFTW has MPI.

In order to try to mitigate this I built my own openblas 3.21 w/o threads, fftw w/o mpi, gsl and libxc (jdftx with preinstalled gcc+my openblas + my fftw+ my gsl+my libxc+preinstalled openmpi). It already gave me constant 100% CPU load for threads, which is nice.

ElecMinimize time for 16 MPI processes (preinstalled gcc+mkl+fftw+gsl+libxc+openmpi) was 92 sec often crashing possibly to not enough RAM. Increasing number of threads usually did not help much, because for preinstalled gcc+mkl+fftw+gsl+libxc+openmpi only main threads of mpi processes worked with other threads mostly idling.

ElecMinimize time for 16 MPI processesx2threads=32 cpus (jdftx with preinstalled gcc+my openblas + my fftw+ my gsl+my libxc+preinstalled openmpi) was 60 sec, which is ~30% slower than expected 46sec. All of this w/o any affinity masks, because allocating the whole node has prohibitively long waiting time.

This is quite good result if it will work well with other procs+threads arrangements. 1-9% or 10-38% time spent in MPI communications depending on run, possibly due to presence of other people's programs on compute nodes.

My conclusion is - if you use mpi/threading etc in your program, rebuild libs w/o them as they can wildly decrease your performance. It could be however, also due to the change from mkl to openblas.

I have also a question about profiling. Which profiling software do you use? How much does it slow down JDFTx?

Best wishes, Igor.

shankar1729 commented 7 months ago

Thanks for the updates, Igor! In your comparison of the two cases with 32 MPI processes, with 1 and 2 threads/process, are these all physical cores? It is quite difficult to conclude performance when comparing partial node jobs shared with others. So that expected time of 46s may not be appropriate for this scenario.

Regardless, using BLAS without threads and letting jdftx handle the threads is definitely the safer option to avoid overcommitting the cores.

Finally, as for profiling, I built in a basic, light-weight profiler for JDFTx that can be enabled with EnableProfiling=yes during cmake build. This only profiles a few top-level functions (hard-coded) to see where time is being spent, and so does not add any noticeable overhead. I use cuda profiler to tune the cuda kernels, but most of the time spent in jdftx is in the libraries, so I don't do very fine-grained cpu profiling (occasionally use linux's perf tool).

Best, Shankar

ximik69 commented 7 months ago

Dear Shankar, thank you very much for the comment.

Actually the cores mentioned are physical ones. I understand that other processes interfere quite noticeable within the node. At least I hope that the performance is not bad. Actually, often waiting time in the queue is what dominates total time to the result. Problem is that the cluster I use was not meant for MPI+threads but for pure MPI. Therefore the queue has such priorities that it gives better opportunities to split your task into processes and run it on say 8-10 nodes with only a few processes/node. It is like gathering leftovers from others. Better strategy for the performance is to allocate not single cpus but the whole nodes or at least numa nodes or sockets. But the cluster queue should be configured for that from the very beginning, which is not the case. Otherwise while allocation of the whole nodes is possible, the waiting time is so large that the play doesn't worth the candles.

I'll try profiling in JDFTx, it is interesting to see the results.

Best wishes, Igor.