Closed ximik69 closed 6 months ago
Below is startup script
#!/bin/bash
#
# job time, change for what your job requires
#SBATCH -t 00:5:00
#$BATCH -A plgzl3a-cpu
# requesting the number of nodes needed
#SBATCH -N 1
#####SBATCH --exclusive
#SBATCH -p plgrid-testing
#SBATCH --tasks-per-node=4
#SBATCH --cpus-per-task=12
# job name
#SBATCH -J JDFTx_test
#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err
# write this script to stdout-file - useful for scripting errors
cat $0
# load the modules required for you program - customise for your program
cd $SLURM_SUBMIT_DIR
. /net/people/plgrid/plgigoro/.bashrc
module load jdftx/1.7.0-foss-2021b-mkl
export OMP_DISPLAY_ENV=TRUE
export OMP_DISPLAY_AFFINITY=TRUE
export OMP_NUM_THREADS=12
export OMP_PROC_BIND=TRUE
dbg=""
runit="mpiexec --report-bindings jdftx $dbg -m -i X.in -o X.out"
$runit
Can you try OpenMPI with threads using mpirun --bind-to none
directly on the compute node, without SLURM? Could be a clash between the affinity settings in SLURM and OpenMPI. If we can confirm that, we can see what to change.
Also, did you try tweaking the cpu binding flags in sbatch/srun instead of the openMPI ones?
Best, Shankar
Also, in response to the job file above, the OMP flags take no effect on most of JDFTx which is parallelized using pthreads, not OpenMP. That part may only affect threads within MKL. I would recommend linking JDFTx with ThreadedBlas=no for MKL, so that JDFTx takes care of all threading, at least for the initial debugging.
Best, Shankar
Dear Shankar, thanks a lot for such a swift answer!!!
I started mpiexec --report-bindings --bind-to none jdftx $dbg -m -i X.in -o X.out
Unfortunately I can not ssh to running node or attach to it using srun -N1 -n1 --jobid=$1 --pty /bin/bash.
I'll inspect what proggram writes.
I tried to run this: srun -p plgrid-testing -t 50 -A plgzl3a-cpu -N 1 -n 24 --pty /bin/bash
then: mpiexec --report-bindings --bind-to none -n 2 jdftx $dbg -m -c 12 -i X.in | tee X.out
I can see that iteration times are not bad but I can not view simultaneously with htop.
github also screws up with attachments, therefore I have to post em as is.
X.out:
*************** JDFTx 1.7.0 ***************
Start date and time: Thu Oct 5 15:32:41 2023
Executable jdftx with command-line: -m -c 12 -i X.in
Running on hosts (process indices): ac0619 (0-1)
Divided in process groups (process indices): 0 (0) 1 (1)
Resource initialization completed at t[s]: 0.00
Run totals: 2 processes, 24 threads, 0 GPUs
Input parsed successfully to the following command list (including defaults):
basis kpoint-dependent
cache-projectors no
coords-type Lattice
core-overlap-check vector
coulomb-interaction Periodic
davidson-band-ratio 1.1
density-of-states Etol 1.000000e-06 Esigma 0.000000e+00 \
Complete \
Total \
Occupied \
Total
dump End State IonicPositions Lattice EigStats DOS
dump Electronic State Forces
dump Ionic State IonicPositions Forces Lattice
dump-interval Electronic 8
dump-name X.$var
elec-cutoff 10
elec-eigen-algo Davidson
elec-ex-corr gga-PBE
elec-smearing Fermi 0.0095
electronic-minimize \
dirUpdateScheme FletcherReeves \
linminMethod DirUpdateRecommended \
nIterations 2000 \
history 15 \
knormThreshold 0 \
energyDiffThreshold 1e-06 \
nEnergyDiff 2 \
alphaTstart 1 \
alphaTmin 1e-10 \
updateTestStepSize yes \
alphaTreduceFactor 0.1 \
alphaTincreaseFactor 3 \
nAlphaAdjustMax 3 \
wolfeEnergy 0.0001 \
wolfeGradient 0.9 \
fdTest no
exchange-regularization WignerSeitzTruncated
fluid None
fluid-ex-corr (null) lda-PZ
fluid-gummel-loop 10 1.000000e-05
fluid-minimize \
dirUpdateScheme PolakRibiere \
linminMethod DirUpdateRecommended \
nIterations 100 \
history 15 \
knormThreshold 0 \
energyDiffThreshold 0 \
nEnergyDiff 2 \
alphaTstart 1 \
alphaTmin 1e-10 \
updateTestStepSize yes \
alphaTreduceFactor 0.1 \
alphaTincreaseFactor 3 \
nAlphaAdjustMax 3 \
wolfeEnergy 0.0001 \
wolfeGradient 0.9 \
fdTest no
fluid-solvent H2O 55.338 ScalarEOS \
epsBulk 78.4 \
pMol 0.92466 \
epsInf 1.77 \
Pvap 1.06736e-10 \
sigmaBulk 4.62e-05 \
Rvdw 2.61727 \
Res 1.42 \
tauNuc 343133 \
poleEl 15 7 1
forces-output-coords Positions
initial-state X.$var
ion Si 0.061602711441747 0.292062989005408 0.186816515940134 1
ion Si 0.938397257183252 0.707937011744592 0.813183447809866 1
ion Si 0.938397257183252 0.292062989005408 0.313183447809866 1
ion Si 0.061602711441747 0.707937011744592 0.686816515940134 1
ion Si 0.561602711441747 0.792062989005408 0.186816515940134 1
ion Si 0.438397257183252 0.207937011744592 0.813183447809866 1
ion Si 0.438397257183252 0.792062989005408 0.313183447809866 1
ion Si 0.561602711441747 0.207937011744592 0.686816515940134 1
ion Si 0.098290478501462 0.037391893097691 0.357765039787784 1
ion Si 0.901709490123537 0.962608107652309 0.642234923962216 1
ion Si 0.901709490123537 0.037391893097691 0.142234923962216 1
ion Si 0.098290478501462 0.962608107652309 0.857765039787784 1
ion Si 0.598290478501462 0.537391893097691 0.357765039787784 1
ion Si 0.401709490123537 0.462608107652309 0.642234923962216 1
ion Si 0.401709490123537 0.537391893097691 0.142234923962216 1
ion Si 0.598290478501462 0.462608107652309 0.857765039787784 1
ion Na 0.351190096562829 0.338260596131203 0.358380793977603 1
ion Na 0.648809872062171 0.661739404618797 0.641619169772397 1
ion Na 0.648809872062171 0.338260596131203 0.141619169772397 1
ion Na 0.351190096562829 0.661739404618797 0.858380793977603 1
ion Na 0.851190096562829 0.838260596131203 0.358380793977603 1
ion Na 0.148809872062171 0.161739404618797 0.641619169772397 1
ion Na 0.148809872062171 0.838260596131203 0.141619169772397 1
ion Na 0.851190096562829 0.161739404618797 0.858380793977603 1
ion Na 0.369136137512290 0.093767271292924 0.048717562509565 1
ion Na 0.630863831112710 0.906232729457076 0.951282401240435 1
ion Na 0.630863831112710 0.093767271292924 0.451282401240435 1
ion Na 0.369136137512290 0.906232729457076 0.548717562509565 1
ion Na 0.869136137512290 0.593767271292924 0.048717562509565 1
ion Na 0.130863831112710 0.406232729457076 0.951282401240435 1
ion Na 0.130863831112710 0.593767271292924 0.451282401240435 1
ion Na 0.869136137512290 0.406232729457076 0.548717562509565 1
ion-species SG15/$ID_ONCV_PBE.upf
ion-width 0
ionic-minimize \
dirUpdateScheme L-BFGS \
linminMethod DirUpdateRecommended \
nIterations 0 \
history 15 \
knormThreshold 0.0001 \
energyDiffThreshold 1e-06 \
nEnergyDiff 2 \
alphaTstart 1 \
alphaTmin 1e-10 \
updateTestStepSize yes \
alphaTreduceFactor 0.1 \
alphaTincreaseFactor 3 \
nAlphaAdjustMax 3 \
wolfeEnergy 0.0001 \
wolfeGradient 0.9 \
fdTest no
kpoint 0.000000000000 0.000000000000 0.000000000000 1.00000000000000
kpoint-folding 2 3 2
latt-move-scale 1 1 1
latt-scale 1 1 1
lattice \
22.928577882949284 0.000000000000000 -10.094230904711511 \
0.000000000000000 12.457297591368569 0.000000000000000 \
0.069314861190520 0.000000000000000 18.327192900226532
lattice-minimize \
dirUpdateScheme L-BFGS \
linminMethod DirUpdateRecommended \
nIterations 0 \
history 15 \
knormThreshold 0 \
energyDiffThreshold 1e-06 \
nEnergyDiff 2 \
alphaTstart 1 \
alphaTmin 1e-10 \
updateTestStepSize yes \
alphaTreduceFactor 0.1 \
alphaTincreaseFactor 3 \
nAlphaAdjustMax 3 \
wolfeEnergy 0.0001 \
wolfeGradient 0.9 \
fdTest no
lcao-params -1 1e-06 0.0095
pcm-variant GLSSA13
spintype no-spin
subspace-rotation-factor 1 yes
symmetries automatic
symmetry-threshold 0.001
---------- Setting up symmetries ----------
Non-trivial transmission matrix:
[ 1 0 0 ]
[ 0 1 0 ]
[ 1 0 1 ]
with reduced lattice vectors:
[ 12.834347 0.000000 -10.094231 ]
[ 0.000000 12.457298 0.000000 ]
[ 18.396508 0.000000 18.327193 ]
Found 4 point-group symmetries of the bravais lattice
Found 8 space-group symmetries with basis
Applied RMS atom displacement 5.57399e-15 bohrs to make symmetries exact.
---------- Initializing the Grid ----------
R =
[ 22.9286 0 -10.0942 ]
[ 0 12.4573 0 ]
[ 0.0693149 0 18.3272 ]
unit cell volume = 5243.48
G =
[ 0.273577 -0 0.150681 ]
[ 0 0.504378 -0 ]
[ -0.00103469 0 0.342264 ]
Minimum fftbox size, Smin = [ 68 36 60 ]
Chosen fftbox size, S = [ 70 36 60 ]
---------- Exchange Correlation functional ----------
Initalized PBE GGA exchange.
Initalized PBE GGA correlation.
---------- Setting up pseudopotentials ----------
Width of ionic core gaussian charges (only for fluid interactions / plotting) set to 0
Reading pseudopotential file '/net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Si_ONCV_PBE.upf':
'Si' pseudopotential, 'PBE' functional
Generated using ONCVPSP code by D. R. Hamann
Author: Martin Schlipf and Francois Gygi Date: 150915.
4 valence electrons, 2 orbitals, 4 projectors, 1510 radial grid points, with lMax = 1
Transforming local potential to a uniform radial grid of dG=0.02 with 1024 points.
Transforming nonlocal projectors to a uniform radial grid of dG=0.02 with 307 points.
3S l: 0 occupation: 2.0 eigenvalue: -0.397365
3P l: 1 occupation: 2.0 eigenvalue: -0.149981
Transforming atomic orbitals to a uniform radial grid of dG=0.02 with 307 points.
Core radius for overlap checks: 2.98 bohrs.
Reading pulay file /net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Si_ONCV_PBE.pulay ... using dE_dnG = -1.274872e-03 computed for Ecut = 10.
Reading pseudopotential file '/net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Na_ONCV_PBE.upf':
'Na' pseudopotential, 'PBE' functional
Generated using ONCVPSP code by D. R. Hamann
Author: Martin Schlipf and Francois Gygi Date: 150915.
9 valence electrons, 3 orbitals, 4 projectors, 1992 radial grid points, with lMax = 1
Transforming local potential to a uniform radial grid of dG=0.02 with 1024 points.
Transforming nonlocal projectors to a uniform radial grid of dG=0.02 with 307 points.
2S l: 0 occupation: 2.0 eigenvalue: -2.085640
2P l: 1 occupation: 6.0 eigenvalue: -1.053708
3S l: 0 occupation: 1.0 eigenvalue: -0.100838
Transforming atomic orbitals to a uniform radial grid of dG=0.02 with 307 points.
Core radius for overlap checks: 2.01 bohrs.
Reading pulay file /net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/share/jdftx/pseudopotentials/SG15/Na_ONCV_PBE.pulay ... using dE_dnG = -1.681960e+00 computed for Ecut = 10.
Initialized 2 species with 32 total atoms.
Folded 1 k-points by 2x3x2 to 12 k-points.
---------- Setting up k-points, bands, fillings ----------
Reduced to 8 k-points under symmetry.
Computing the number of bands and number of electrons
Calculating initial fillings.
nElectrons: 208.000000 nBands: 144 nStates: 8
----- Setting up reduced wavefunction bases (one per k-point) -----
average nbasis = 7919.083 , ideal nbasis = 7919.786
----- Initializing Supercell corresponding to k-point mesh -----
Lattice vector linear combinations in supercell:
[ 2 0 0 ]
[ 0 3 0 ]
[ 0 0 2 ]
Supercell lattice vectors:
[ 45.8572 0 -20.1885 ]
[ 0 37.3719 0 ]
[ 0.13863 0 36.6544 ]
---------- Setting up ewald sum ----------
Optimum gaussian width for ewald sums = 3.889779 bohr.
Real space sum over 567 unit cells with max indices [ 3 4 4 ]
Reciprocal space sum over 5733 terms with max indices [ 10 6 10 ]
---------- Allocating electronic variables ----------
Initializing wave functions: linear combination of atomic orbitals
Si pseudo-atom occupations: s ( 2 ) p ( 2 )
Na pseudo-atom occupations: s ( 2 1 ) p ( 6 )
FillingsUpdate: mu: +0.201822279 nElectrons: 208.000000
LCAOMinimize: Iter: 0 F: -729.2090650262059626 |grad|_K: 1.689e-03 alpha: 1.000e+00
FillingsUpdate: mu: +0.181357224 nElectrons: 208.000000
LCAOMinimize: Iter: 1 F: -729.3693863155830286 |grad|_K: 3.575e-04 alpha: 6.002e-01 linmin: -2.079e-02 cgtest: 1.523e-01 t[s]: 2.90
FillingsUpdate: mu: +0.181985758 nElectrons: 208.000000
LCAOMinimize: Iter: 2 F: -729.3766674728333328 |grad|_K: 1.384e-04 alpha: 6.722e-01 linmin: 1.997e-03 cgtest: -2.733e-02 t[s]: 3.96
FillingsUpdate: mu: +0.181879253 nElectrons: 208.000000
LCAOMinimize: Iter: 3 F: -729.3773034865737372 |grad|_K: 2.609e-05 alpha: 3.977e-01 linmin: -1.111e-04 cgtest: 4.285e-02 t[s]: 4.90
FillingsUpdate: mu: +0.181657222 nElectrons: 208.000000
LCAOMinimize: Iter: 4 F: -729.3773419649966172 |grad|_K: 6.791e-06 alpha: 6.718e-01 linmin: -4.698e-04 cgtest: 1.067e-03 t[s]: 5.88
FillingsUpdate: mu: +0.181726236 nElectrons: 208.000000
LCAOMinimize: Iter: 5 F: -729.3773446574464288 |grad|_K: 9.288e-07 alpha: 6.934e-01 linmin: -1.074e-04 cgtest: -1.541e-02 t[s]: 6.94
FillingsUpdate: mu: +0.181726671 nElectrons: 208.000000
LCAOMinimize: Iter: 6 F: -729.3773446824303619 |grad|_K: 1.622e-07 alpha: 3.443e-01 linmin: -7.979e-06 cgtest: -6.935e-04 t[s]: 7.95
FillingsUpdate: mu: +0.181725395 nElectrons: 208.000000
LCAOMinimize: Iter: 7 F: -729.3773446840687029 |grad|_K: 3.444e-08 alpha: 7.441e-01 linmin: 5.051e-03 cgtest: -2.493e-02 t[s]: 8.94
LCAOMinimize: Converged (|Delta F|<1.000000e-06 for 2 iters).
---- Citations for features of the code used in this run ----
Software package:
R. Sundararaman, K. Letchworth-Weaver, K.A. Schwarz, D. Gunceler, Y. Ozhabes and T.A. Arias, 'JDFTx: software for joint density-functional theory', SoftwareX 6, 278 (2017)
gga-PBE exchange-correlation functional:
J.P. Perdew, K. Burke and M. Ernzerhof, Phys. Rev. Lett. 77, 3865 (1996)
Pseudopotentials:
M Schlipf and F Gygi, Comput. Phys. Commun. 196, 36 (2015)
Total energy minimization with Auxiliary Hamiltonian:
C. Freysoldt, S. Boeck, and J. Neugebauer, Phys. Rev. B 79, 241103(R) (2009)
Linear-tetrahedron sampling for density of states:
G. Lehmann and M. Taut, Phys. status solidi (b) 54, 469 (1972)
This list may not be complete. Please suggest additional citations or
report any other bugs at https://github.com/shankar1729/jdftx/issues
Initialization completed successfully at t[s]: 9.37
-------- Electronic minimization -----------
FillingsUpdate: mu: +0.181725395 nElectrons: 208.000000
ElecMinimize: Iter: 0 F: -729.377344684068930 |grad|_K: 5.833e-04 alpha: 1.000e+00
FillingsUpdate: mu: +0.174884513 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1
ElecMinimize: Iter: 1 F: -730.620281943531040 |grad|_K: 1.901e-04 alpha: 3.972e-01 linmin: 4.841e-04 t[s]: 12.41
FillingsUpdate: mu: +0.169771004 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.03
ElecMinimize: Iter: 2 F: -730.837248274446893 |grad|_K: 1.239e-04 alpha: 6.529e-01 linmin: -7.589e-05 t[s]: 14.54
FillingsUpdate: mu: +0.164481604 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.06
ElecMinimize: Iter: 3 F: -730.931821771194222 |grad|_K: 7.164e-05 alpha: 6.679e-01 linmin: -7.699e-06 t[s]: 16.39
FillingsUpdate: mu: +0.162754975 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.1
ElecMinimize: Iter: 4 F: -730.958002708480421 |grad|_K: 4.985e-05 alpha: 5.536e-01 linmin: 3.191e-05 t[s]: 18.70
FillingsUpdate: mu: +0.162285216 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.13
ElecMinimize: Iter: 5 F: -730.970350330141514 |grad|_K: 3.494e-05 alpha: 5.406e-01 linmin: 5.257e-05 t[s]: 20.82
FillingsUpdate: mu: +0.161943814 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.16
ElecMinimize: Iter: 6 F: -730.976393429656810 |grad|_K: 2.271e-05 alpha: 5.401e-01 linmin: 4.563e-05 t[s]: 22.79
FillingsUpdate: mu: +0.161302037 nElectrons: 208.000000
Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.force' ... done
Dumping 'X.eigenvals' ... done
SubspaceRotationAdjust: set factor to 1.23
ElecMinimize: Iter: 7 F: -730.978978326052356 |grad|_K: 1.639e-05 alpha: 5.468e-01 linmin: 2.597e-05 t[s]: 24.81
FillingsUpdate: mu: +0.160892251 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.26
ElecMinimize: Iter: 8 F: -730.980207120993555 |grad|_K: 1.050e-05 alpha: 4.976e-01 linmin: -8.406e-08 t[s]: 26.56
FillingsUpdate: mu: +0.160863060 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.29
ElecMinimize: Iter: 9 F: -730.980785433417964 |grad|_K: 6.906e-06 alpha: 5.700e-01 linmin: -2.891e-06 t[s]: 28.63
FillingsUpdate: mu: +0.160857671 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.35
ElecMinimize: Iter: 10 F: -730.981049912807066 |grad|_K: 4.890e-06 alpha: 6.024e-01 linmin: 8.936e-06 t[s]: 30.82
FillingsUpdate: mu: +0.160793025 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.39
ElecMinimize: Iter: 11 F: -730.981171827026969 |grad|_K: 3.339e-06 alpha: 5.543e-01 linmin: 9.365e-06 t[s]: 32.90
FillingsUpdate: mu: +0.160772038 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.43
ElecMinimize: Iter: 12 F: -730.981231201059586 |grad|_K: 2.236e-06 alpha: 5.791e-01 linmin: 8.257e-06 t[s]: 34.81
FillingsUpdate: mu: +0.160768562 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.46
ElecMinimize: Iter: 13 F: -730.981258513158764 |grad|_K: 1.651e-06 alpha: 5.942e-01 linmin: 3.419e-06 t[s]: 36.63
FillingsUpdate: mu: +0.160745566 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.5
ElecMinimize: Iter: 14 F: -730.981272397531598 |grad|_K: 1.193e-06 alpha: 5.536e-01 linmin: 1.708e-06 t[s]: 38.65
FillingsUpdate: mu: +0.160741943 nElectrons: 208.000000
Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.force' ... done
Dumping 'X.eigenvals' ... done
SubspaceRotationAdjust: set factor to 1.54
ElecMinimize: Iter: 15 F: -730.981279634896055 |grad|_K: 8.396e-07 alpha: 5.527e-01 linmin: 1.588e-06 t[s]: 41.20
FillingsUpdate: mu: +0.160753862 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.55
ElecMinimize: Iter: 16 F: -730.981283228877373 |grad|_K: 6.225e-07 alpha: 5.542e-01 linmin: 1.301e-06 t[s]: 43.76
FillingsUpdate: mu: +0.160749440 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.61
ElecMinimize: Iter: 17 F: -730.981285325200702 |grad|_K: 4.516e-07 alpha: 5.880e-01 linmin: 1.748e-06 t[s]: 45.86
FillingsUpdate: mu: +0.160741128 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.61
ElecMinimize: Iter: 18 F: -730.981286478297307 |grad|_K: 3.254e-07 alpha: 6.149e-01 linmin: 1.433e-06 t[s]: 48.50
FillingsUpdate: mu: +0.160744203 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.63
ElecMinimize: Iter: 19 F: -730.981287080700554 |grad|_K: 2.511e-07 alpha: 6.186e-01 linmin: 8.838e-07 t[s]: 50.52
FillingsUpdate: mu: +0.160743209 nElectrons: 208.000000
SubspaceRotationAdjust: set factor to 1.65
ElecMinimize: Iter: 20 F: -730.981287437568767 |grad|_K: 1.882e-07 alpha: 6.152e-01 linmin: 6.871e-07 t[s]: 52.45
ElecMinimize: Converged (|Delta F|<1.000000e-06 for 2 iters).
Setting wave functions to eigenvectors of Hamiltonian
# Ionic positions in lattice coordinates:
ion Si 0.061602711441747 0.292062989005408 0.186816515940134 1
ion Si 0.938397257183252 0.707937011744592 0.813183447809866 1
ion Si 0.938397257183252 0.292062989005408 0.313183447809866 1
ion Si 0.061602711441747 0.707937011744592 0.686816515940134 1
ion Si 0.561602711441747 0.792062989005408 0.186816515940134 1
ion Si 0.438397257183252 0.207937011744592 0.813183447809866 1
ion Si 0.438397257183252 0.792062989005408 0.313183447809866 1
ion Si 0.561602711441747 0.207937011744592 0.686816515940134 1
ion Si 0.098290478501462 0.037391893097691 0.357765039787784 1
ion Si 0.901709490123537 0.962608107652309 0.642234923962216 1
ion Si 0.901709490123537 0.037391893097691 0.142234923962216 1
ion Si 0.098290478501462 0.962608107652309 0.857765039787784 1
ion Si 0.598290478501462 0.537391893097691 0.357765039787784 1
ion Si 0.401709490123537 0.462608107652309 0.642234923962216 1
ion Si 0.401709490123537 0.537391893097691 0.142234923962216 1
ion Si 0.598290478501462 0.462608107652309 0.857765039787784 1
ion Na 0.351190096562829 0.338260596131203 0.358380793977603 1
ion Na 0.648809872062171 0.661739404618797 0.641619169772397 1
ion Na 0.648809872062171 0.338260596131203 0.141619169772397 1
ion Na 0.351190096562829 0.661739404618797 0.858380793977603 1
ion Na 0.851190096562829 0.838260596131203 0.358380793977603 1
ion Na 0.148809872062171 0.161739404618797 0.641619169772397 1
ion Na 0.148809872062171 0.838260596131203 0.141619169772397 1
ion Na 0.851190096562829 0.161739404618797 0.858380793977603 1
ion Na 0.369136137512290 0.093767271292924 0.048717562509565 1
ion Na 0.630863831112710 0.906232729457076 0.951282401240435 1
ion Na 0.630863831112710 0.093767271292924 0.451282401240435 1
ion Na 0.369136137512290 0.906232729457076 0.548717562509565 1
ion Na 0.869136137512290 0.593767271292924 0.048717562509565 1
ion Na 0.130863831112710 0.406232729457076 0.951282401240435 1
ion Na 0.130863831112710 0.593767271292924 0.451282401240435 1
ion Na 0.869136137512290 0.406232729457076 0.548717562509565 1
# Forces in Lattice coordinates:
force Si 0.055038867847070 0.018758366364779 -0.040478332871196 1
force Si -0.055038867847070 -0.018758366364779 0.040478332871196 1
force Si -0.055038867847070 0.018758366364779 0.040478332871196 1
force Si 0.055038867847070 -0.018758366364779 -0.040478332871196 1
force Si 0.055038867847070 0.018758366364780 -0.040478332871195 1
force Si -0.055038867847074 -0.018758366364779 0.040478332871197 1
force Si -0.055038867847070 0.018758366364779 0.040478332871196 1
force Si 0.055038867847070 -0.018758366364779 -0.040478332871196 1
force Si 0.032246678913983 -0.017937318402096 0.022064496004864 1
force Si -0.032246678913982 0.017937318402096 -0.022064496004864 1
force Si -0.032246678913985 -0.017937318402096 -0.022064496004864 1
force Si 0.032246678913984 0.017937318402096 0.022064496004864 1
force Si 0.032246678913983 -0.017937318402096 0.022064496004864 1
force Si -0.032246678913982 0.017937318402096 -0.022064496004864 1
force Si -0.032246678913984 -0.017937318402096 -0.022064496004864 1
force Si 0.032246678913985 0.017937318402096 0.022064496004864 1
force Na -0.003293935111161 -0.008850391847747 0.001948435288175 1
force Na 0.003293935111160 0.008850391847747 -0.001948435288174 1
force Na 0.003293935111160 -0.008850391847747 -0.001948435288175 1
force Na -0.003293935111159 0.008850391847747 0.001948435288174 1
force Na -0.003293935111162 -0.008850391847747 0.001948435288174 1
force Na 0.003293935111160 0.008850391847747 -0.001948435288174 1
force Na 0.003293935111160 -0.008850391847747 -0.001948435288174 1
force Na -0.003293935111161 0.008850391847747 0.001948435288174 1
force Na 0.001163315121082 -0.002183149719639 0.005426023666742 1
force Na -0.001163315121082 0.002183149719639 -0.005426023666743 1
force Na -0.001163315121082 -0.002183149719639 -0.005426023666742 1
force Na 0.001163315121082 0.002183149719639 0.005426023666742 1
force Na 0.001163315121082 -0.002183149719639 0.005426023666742 1
force Na -0.001163315121082 0.002183149719639 -0.005426023666742 1
force Na -0.001163315121082 -0.002183149719639 -0.005426023666742 1
force Na 0.001163315121082 0.002183149719639 0.005426023666743 1
# Energy components:
Eewald = -382.9718366837728354
EH = 243.5527654332317127
Eloc = -648.9546805863952841
Enl = -170.3465207339560834
Epulay = -0.0036071722955318
Exc = -114.9676103850727742
KE = 342.7346635440668479
-------------------------------------
Etot = -730.9568265841941184
TS = 0.0244608533746059
-------------------------------------
F = -730.9812874375687670
Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.ionpos' ... done
Dumping 'X.force' ... done
Dumping 'X.lattice' ... done
Dumping 'X.eigenvals' ... done
IonicMinimize: Iter: 0 F: -730.981287437568767 |grad|_K: 1.205e-03 t[s]: 53.05
IonicMinimize: None of the convergence criteria satisfied after 0 iterations.
#--- Lowdin population analysis ---
# oxidation-state Si -0.527 -0.527 -0.527 -0.527 -0.527 -0.527 -0.527 -0.527 -0.502 -0.502 -0.502 -0.502 -0.502 -0.502 -0.502 -0.502
# oxidation-state Na +0.609 +0.609 +0.609 +0.609 +0.609 +0.609 +0.609 +0.609 +0.579 +0.579 +0.579 +0.579 +0.579 +0.579 +0.579 +0.579
Dumping 'X.fillings' ... done
Dumping 'X.wfns' ... done
Dumping 'X.ionpos' ... done
Dumping 'X.lattice' ... done
Dumping 'X.eigenvals' ... done
Dumping 'X.eigStats' ...
eMin: -2.011224 at state 3 ( [ +0.000000 +0.333333 +0.500000 ] spin 0 )
HOMO: +0.135751 at state 0 ( [ +0.000000 +0.000000 +0.000000 ] spin 0 )
mu : +0.160743
LUMO: +0.173235 at state 0 ( [ +0.000000 +0.000000 +0.000000 ] spin 0 )
eMax: +0.410730 at state 2 ( [ +0.000000 +0.333333 +0.000000 ] spin 0 )
HOMO-LUMO gap: +0.037484
Optical gap : +0.037484 at state 0 ( [ +0.000000 +0.000000 +0.000000 ] spin 0 )
Dumping 'X.dos' ... done.
End date and time: Thu Oct 5 15:33:34 2023 (Duration: 0-0:00:53.37)
Done!
Overall trouble with OpenMPI, running by slurm, is that it shows sensible binding maps but thread binding screws up:
[ac0116:3754220] MCW rank 1 bound to socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]], socket 0[core 16[hwt 0]], socket 0[core 17[hwt 0]], socket 0[core 18[hwt 0]], socket 0[core 19[hwt 0]], socket 0[core 20[hwt 0]], socket 0[core 21[hwt 0]], socket 0[core 22[hwt 0]], socket 0[core 23[hwt 0]]: [././././././././././././B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././.]
[ac0116:3754220] MCW rank 2 bound to socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]], socket 1[core 26[hwt 0]], socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]], socket 1[core 30[hwt 0]], socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]], socket 1[core 34[hwt 0]], socket 1[core 35[hwt 0]]: [./././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././.]
[ac0116:3754220] MCW rank 3 bound to socket 1[core 36[hwt 0]], socket 1[core 37[hwt 0]], socket 1[core 38[hwt 0]], socket 1[core 39[hwt 0]], socket 1[core 40[hwt 0]], socket 1[core 41[hwt 0]], socket 1[core 42[hwt 0]], socket 1[core 43[hwt 0]], socket 1[core 44[hwt 0]], socket 1[core 45[hwt 0]], socket 1[core 46[hwt 0]], socket 1[core 47[hwt 0]]: [./././././././././././././././././././././././.][././././././././././././B/B/B/B/B/B/B/B/B/B/B/B]
[ac0116:3754220] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././.][./././././././././././././././././././././././.]
How are you determining that the thread binding is broken after this report? Also, try with a larger supercell calculation to make sure you have enough work to keep the threads busy.
When I start JDFTx using sbatch I can sometimes attach to working node and run htop. And I see problems. I tested this with bigger task, namely cutoff 50Ha and 3x5x3 k-point mesh and it is clearly visible.
Unfortunately I can not attach to node, running interactive session.
I tried several times to attach to node running interactive session with mpirun ... jdftx ... but it seems impossible. I tried srun -N1 -n1 --jobid=$1 --pty /bin/bash and got: srun: Job 5295157 step creation temporarily disabled, retrying (Requested nodes are busy)
I can not ssh to running nodes too, cause access is not permitted.
trying "at now", then mpirun ... jdftx ... does not work too.
Therefore I can not test your recipe due to restrictions on cluster. But can I somehow run mpirun+jdftx totally independent from slurm via sbatch and see cpu load with htop?
I heard on seminar about LUMI (AMD cpus), that combination OpenMPI+slurm kinda sucks because of some problems with placement.
Also, did you try tweaking the cpu binding flags in sbatch/srun instead of the openMPI ones?
No, I did not know about that.
I'd suggest trying with a different MPI build then, eg. MPICH or MVAPICH2.
I succeeded to run jdftx +htop in this way:
mpiexec --report-bindings --bind-to none -n 2 jdftx $dbg -m -c 12 -i X.in -o X.out & htop
Not sure whether this is correct. But it seems to be that binding issue does not exist if running independent from slurm.
Dear Shankar, thank you very much for your hints. I estimate that cores are loaded approx. 70% of time if running a big task independent from slurm.
Still can not attach screenshots.
Best wishes, Igor.
Hello dear Shankar. A small conclusion. If i run JDFTx by OpenMPI as sbatch task, thread binding fails. But if I run it in interactive session if works fine.
I'll try other mpi programs later and see whether problem persists.
Best wishes, Igor.
A short update. I have built JDFTx with MVAPICH. It gives the same situation as with OpenMPI under the default settings.
But there are interesting parameters in the documentation http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-userguide.pdf: MV2 CPU BINDING POLICY=hybrid MV2 THREADS PER PROCESS= MV2 HYBRID BINDING POLICY=
Let's see what happens.
The issue might once again be SLURM overridding the settings when you go through it. Check whether these variables can override the SLURM settings, if that's the case.
MV2_HYBRID_BINDING_POLICY=linear gives 12 running cores of 12 desired. However thread locality is bad and calculations are slow (10sec iteration/12 total cores)
#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH -p plgrid-testing
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
# job name
#SBATCH -J Na_conv
#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err
# write this script to stdout-file - useful for scripting errors
#cat $0
cd $SLURM_SUBMIT_DIR
. /net/people/plgrid/plgigoro/.bashrc
module purge
module load jdftx_mvapich
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_CPU_BINDING_POLICY=hybrid
export MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK
export MV2_HYBRID_BINDING_POLICY=linear
dbg=""
#dbg="-n"
dbg=""
runit="mpiexec -np $SLURM_NTASKS_PER_NODE jdftx $dbg -m -c $SLURM_CPUS_PER_TASK -i X.in -o X.out"
$runit
Next.
MV2_HYBRID_BINDING_POLICY=compact is the same (no hyperthreading).
MV2_HYBRID_BINDING_POLICY=numa is the similarly slow (10sec iteration/1MPI process 12 total cores) but CPU load oscillates.
MV2_HYBRID_BINDING_POLICY=bunch is an epic fail (18sec LCAOminimize iteration/1MPI process 12 total cores instead of 2-3 sec as with previous settings).
MV2_HYBRID_BINDING_POLICY=scatter is also slow.
The next thing I tried was to print cpu bindings.
With the script below
#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH -p plgrid-testing
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12
# job name
#SBATCH -J Na_conv
#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err
# write this script to stdout-file - useful for scripting errors
cat $0
cd $SLURM_SUBMIT_DIR
. /net/people/plgrid/plgigoro/.bashrc
module purge
module load jdftx_mvapich
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_CPU_BINDING_POLICY=hybrid
export MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK
export MV2_HYBRID_BINDING_POLICY=numa
export MV2_SHOW_CPU_BINDING=1
#export MPICH_DBG_OUTPUT=VERBOSE
#export MPICH_DBG_CLASS=ALL
#export MPICH_DBG_FILENAME="dbg-%w-%d.log"
env
dbg=""
#dbg="-n"
dbg=""
runit="mpiexec -np $SLURM_NTASKS_PER_NODE -env MPICH_DBG_OUTPUT=VERBOSE -env MPICH_DBG_CLASS=ALL -env MPICH_DBG_FILENAME="dbg-%w-%d.log" jdftx $dbg -m -c $SLURM_CPUS_PER_TASK -i X.in -o X.out"
$runit
I obtained
-------------CPU AFFINITY-------------
OMP_NUM_THREADS : 12
MV2_THREADS_PER_PROCESS : 12
MV2_CPU_BINDING_POLICY : Hybrid
MV2_HYBRID_BINDING_POLICY : Linear
--------------------------------------
RANK: 0 CPU_SET: 0 1 2 3 7 8 9 13 14 15 19 20; NUMA: 0 Socket: 0
RANK: 1 CPU_SET: 4 5 6 10 11 12 16 17 18 21 22 23; NUMA: 1 Socket: 0
-------------------------------------
Note difference when I try to start with MV2_HYBRID_BINDING_POLICY=numa
I get MV2_HYBRID_BINDING_POLICY : Linear
Strange CPU bindings clearly explain why I have so small performance.
The issue might once again be SLURM overridding the settings when you go through it. Check whether these variables can override the SLURM settings, if that's the case.
Dear Shankar thank you for the explanation.
I'll try to print slurm binding params and see what it shows.
Best wishes, Igor.
Upd. I've read here (https://slurm.schedmd.com/cpu_management.html) about srun option: --cpu-bind=verbose, but I don't know how to print cpu binding settings for slurm using sbatch. ''' grep SLURM process_5394096.out ''' does not show anything informative:
cd $SLURM_SUBMIT_DIR
#export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#export MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK
runit="mpiexec -np $SLURM_NTASKS_PER_NODE \
-env OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK \
-env MV2_THREADS_PER_PROCESS=$SLURM_CPUS_PER_TASK \
-env MPICH_DBG_FILENAME="dbg-%w-%d.log" jdftx $dbg -m -c $SLURM_CPUS_PER_TASK -i X.in -o X.out"
SLURM_MEM_PER_CPU=3850
SLURM_NODEID=0
SLURM_TASK_PID=3363709
SLURM_PRIO_PROCESS=0
SLURM_SUBMIT_DIR=/net/ascratch/people/plgigoro/JDFTx_test10mvapich
SLURM_CPUS_PER_TASK=12
SLURM_PROCID=0
SLURM_JOB_GID=100000
SLURMD_NODENAME=ac0543
SLURM_JOB_END_TIME=1697106836
SLURM_TASKS_PER_NODE=2
SLURM_NNODES=1
SLURM_JOB_START_TIME=1697106236
SLURM_NTASKS_PER_NODE=2
SLURM_JOB_NODELIST=ac0543
SLURM_CLUSTER_NAME=ares
SLURM_NODELIST=ac0543
SLURM_NTASKS=2
SLURM_JOB_CPUS_PER_NODE=24
SLURM_TOPOLOGY_ADDR=core.island2.p0h03c02.ac0543
SLURM_WORKING_CLUSTER=ares:slurm01:6817:9984:109
SLURM_JOB_NAME=Na_conv
SLURM_JOBID=5394096
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_NODE_ALIASES=(null)
SLURM_JOB_QOS=normal
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.node
SLURM_CPUS_ON_NODE=24
SLURM_JOB_NUM_NODES=1
SLURM_JOB_UID=114522
SLURM_JOB_PARTITION=plgrid-testing
SLURM_SCRIPT_CONTEXT=prolog_task
SLURM_JOB_USER=plgigoro
SLURM_NPROCS=2
SLURM_SUBMIT_HOST=login01.ares.cyfronet.pl
SLURM_JOB_ACCOUNT=plgzl3a-cpu
SLURM_GTIDS=0
SLURM_JOB_ID=5394096
SLURM_LOCALID=0
What can I do next?
I think you'll need to consult with your HPC sysadmins again since this seems to be specific to your cluster's setup. I don't think this is specific to JDFTx in any way, but rather broken support for MPI + threads in general. This is quite likely as a lot of programs tend to be pure MPI rather than hybrid MPI+thread parallelized, leading to lack of support for this case.
Additionally, you may want to create a simple program that just does a dummy loop in several threads as a minimal example to help debug your HPC people debug this issue. See these pages for a starting point:
https://enccs.github.io/intermediate-mpi/mpi-and-threads-pt2/
Best, Shankar
Hi dear Shankar,
thanks a lot for your help!!! Actually I am in contact with cluster sysadmins, but I got stuck and possibly they too. I posted them link to this thread. I was pretty sure that problem is not with JDFTx but broken thread binding/affinity/whatsoever on cluster. I contacted you because you as an author of the program know it and its dependencies the best. For instance I previously thought, that JDFTx used OpenMP instead of pthreads and thank you that you corrected me.
You gave me a good simple program to try to debug issues.
I figured out previously, that the cpu numbers for one mpi rank are often very strange, even starting from 1 and not 0. Possibly this is due to allocation of cores without much concerns about their locality.
I compiled the program you told me and saw that it uses OpenMP threads. I'll try to play with MPI/sbatch parameters in order to improve the situation. But could be there any difference in affinity between threads from libpthread and OpenMP?
Thank you once again for your kind help.
Best wishes, Igor.
Indeed, but I have reached the limit of how much I know about this :).
It may be that OpenMP uses some of those environment variables that pthreads does not, but under the hood OpenMP ultimately reduces down to pthreads or equivalent in many implementations. In your case, most likely the issue is coming at the SLURM level, since things work fine outside of job scripts.
Best, Shankar
Ok, thank you very much.
Best wishes, Igor.
Hi dear Shankar, I have found working solution for correct binding, at least for the test program xthi. Thank you once again Shankar for mentioning program xthi.
With the help of admins, and some receipts for LUMI cluster I finally came to the following script:
#!/bin/bash
#SBATCH -t 00:00:10
#SBATCH -N 1
#SBATCH -p plgrid-testing
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#
# filenames stdout and stderr - customise, include %j
#SBATCH -o process_%j.out
#SBATCH -e process_%j.err
# write this script to stdout-file - useful for scripting errors
cat $0
module purge
module load jdftx/1.7.0-foss-2021b-mkl
#1st mask 12 cores
mask0="FFF"
#2nd mask 12 cores
mask1=$mask0"000"
#3rd mask 12 cores
mask2=$mask1"000"
#4th mask 12 cores
mask3=$mask2"000"
echo $mask0
echo $mask1
echo $mask2
echo $mask3
CPU_BIND="mask_cpu:${mask0},${mask1},${mask2},${mask3}"
export OMP_NUM_THREADS=12
#export OMP_PLACES=cores
srun --mpi=pmix --cpu-bind=${CPU_BIND} xthi | sort -n -k 4 -k 6
I'll test this with jdftx, to see whether I can obtain full cpu load using MPI+pthreads.
The most interesting part of the output is:
Hello from rank 0, thread 0, on ac0082. (core affinity = 0)
Hello from rank 0, thread 1, on ac0082. (core affinity = 1)
Hello from rank 0, thread 2, on ac0082. (core affinity = 2)
Hello from rank 0, thread 3, on ac0082. (core affinity = 3)
Hello from rank 0, thread 4, on ac0082. (core affinity = 4)
Hello from rank 0, thread 5, on ac0082. (core affinity = 5)
Hello from rank 0, thread 6, on ac0082. (core affinity = 6)
Hello from rank 0, thread 7, on ac0082. (core affinity = 7)
Hello from rank 0, thread 8, on ac0082. (core affinity = 8)
Hello from rank 0, thread 9, on ac0082. (core affinity = 9)
Hello from rank 0, thread 10, on ac0082. (core affinity = 10)
Hello from rank 0, thread 11, on ac0082. (core affinity = 11)
Hello from rank 1, thread 0, on ac0082. (core affinity = 12)
Hello from rank 1, thread 1, on ac0082. (core affinity = 13)
Hello from rank 1, thread 2, on ac0082. (core affinity = 14)
Hello from rank 1, thread 3, on ac0082. (core affinity = 15)
Hello from rank 1, thread 4, on ac0082. (core affinity = 16)
Hello from rank 1, thread 5, on ac0082. (core affinity = 17)
Hello from rank 1, thread 6, on ac0082. (core affinity = 18)
Hello from rank 1, thread 7, on ac0082. (core affinity = 19)
Hello from rank 1, thread 8, on ac0082. (core affinity = 20)
Hello from rank 1, thread 9, on ac0082. (core affinity = 21)
Hello from rank 1, thread 10, on ac0082. (core affinity = 22)
Hello from rank 1, thread 11, on ac0082. (core affinity = 23)
Hello from rank 2, thread 0, on ac0082. (core affinity = 24)
Hello from rank 2, thread 1, on ac0082. (core affinity = 25)
Hello from rank 2, thread 2, on ac0082. (core affinity = 26)
Hello from rank 2, thread 3, on ac0082. (core affinity = 27)
Hello from rank 2, thread 4, on ac0082. (core affinity = 28)
Hello from rank 2, thread 5, on ac0082. (core affinity = 29)
Hello from rank 2, thread 6, on ac0082. (core affinity = 30)
Hello from rank 2, thread 7, on ac0082. (core affinity = 31)
Hello from rank 2, thread 8, on ac0082. (core affinity = 32)
Hello from rank 2, thread 9, on ac0082. (core affinity = 33)
Hello from rank 2, thread 10, on ac0082. (core affinity = 34)
Hello from rank 2, thread 11, on ac0082. (core affinity = 35)
Hello from rank 3, thread 0, on ac0082. (core affinity = 36)
Hello from rank 3, thread 1, on ac0082. (core affinity = 37)
Hello from rank 3, thread 2, on ac0082. (core affinity = 38)
Hello from rank 3, thread 3, on ac0082. (core affinity = 39)
Hello from rank 3, thread 4, on ac0082. (core affinity = 40)
Hello from rank 3, thread 5, on ac0082. (core affinity = 41)
Hello from rank 3, thread 6, on ac0082. (core affinity = 42)
Hello from rank 3, thread 7, on ac0082. (core affinity = 43)
Hello from rank 3, thread 8, on ac0082. (core affinity = 44)
Hello from rank 3, thread 9, on ac0082. (core affinity = 45)
Hello from rank 3, thread 10, on ac0082. (core affinity = 46)
Hello from rank 3, thread 11, on ac0082. (core affinity = 47)
Best wishes, Igor.
Great to hear that, hope it works for JDFTx next too!
Thank you. Still waiting in the queue.
Hello dear Shankar.
Small update, after examining the output of hpc-jobs-history, the total cpu usage was about ~15%, which was not nice. Later I installed mpiP to profile MPI communication and exclude MPI problems w/o recompiling. In spite of difficult install, it showed that I had kinda 9-21% time spent in MPI communications on my now typical usage pattern : [number of kpoints] processes x 2 threads each. If mpiP was correct in spite inability to compile fortran tests it is at least acceptable. The situation with preinstalled gcc+mkl+fftw+gsl+libxc+openmpi is such, that main threads of mpi processes always use 100% cpu but additional threads work only intermittently for short time, then often idling. Preinstalled FFTW has MPI.
In order to try to mitigate this I built my own openblas 3.21 w/o threads, fftw w/o mpi, gsl and libxc (jdftx with preinstalled gcc+my openblas + my fftw+ my gsl+my libxc+preinstalled openmpi). It already gave me constant 100% CPU load for threads, which is nice.
ElecMinimize time for 16 MPI processes (preinstalled gcc+mkl+fftw+gsl+libxc+openmpi) was 92 sec often crashing possibly to not enough RAM. Increasing number of threads usually did not help much, because for preinstalled gcc+mkl+fftw+gsl+libxc+openmpi only main threads of mpi processes worked with other threads mostly idling.
ElecMinimize time for 16 MPI processesx2threads=32 cpus (jdftx with preinstalled gcc+my openblas + my fftw+ my gsl+my libxc+preinstalled openmpi) was 60 sec, which is ~30% slower than expected 46sec. All of this w/o any affinity masks, because allocating the whole node has prohibitively long waiting time.
This is quite good result if it will work well with other procs+threads arrangements. 1-9% or 10-38% time spent in MPI communications depending on run, possibly due to presence of other people's programs on compute nodes.
My conclusion is - if you use mpi/threading etc in your program, rebuild libs w/o them as they can wildly decrease your performance. It could be however, also due to the change from mkl to openblas.
I have also a question about profiling. Which profiling software do you use? How much does it slow down JDFTx?
Best wishes, Igor.
Thanks for the updates, Igor! In your comparison of the two cases with 32 MPI processes, with 1 and 2 threads/process, are these all physical cores? It is quite difficult to conclude performance when comparing partial node jobs shared with others. So that expected time of 46s may not be appropriate for this scenario.
Regardless, using BLAS without threads and letting jdftx handle the threads is definitely the safer option to avoid overcommitting the cores.
Finally, as for profiling, I built in a basic, light-weight profiler for JDFTx that can be enabled with EnableProfiling=yes during cmake build. This only profiles a few top-level functions (hard-coded) to see where time is being spent, and so does not add any noticeable overhead. I use cuda profiler to tune the cuda kernels, but most of the time spent in jdftx is in the libraries, so I don't do very fine-grained cpu profiling (occasionally use linux's perf tool).
Best, Shankar
Dear Shankar, thank you very much for the comment.
Actually the cores mentioned are physical ones. I understand that other processes interfere quite noticeable within the node. At least I hope that the performance is not bad. Actually, often waiting time in the queue is what dominates total time to the result. Problem is that the cluster I use was not meant for MPI+threads but for pure MPI. Therefore the queue has such priorities that it gives better opportunities to split your task into processes and run it on say 8-10 nodes with only a few processes/node. It is like gathering leftovers from others. Better strategy for the performance is to allocate not single cpus but the whole nodes or at least numa nodes or sockets. But the cluster queue should be configured for that from the very beginning, which is not the case. Otherwise while allocation of the whole nodes is possible, the waiting time is so large that the play doesn't worth the candles.
I'll try profiling in JDFTx, it is interesting to see the results.
Best wishes, Igor.
Hello dear Shankar.
Despite some good experience with JDFTx recently I ran into trouble with thread placement. I asked local sysadmins for help but after long time playing with diferent options we did not manage to run program properly. I run JDFTx on cluster with Intel processors, managed by slurm. There are 2 sockets per node 2 numa nodes per socket = 4 numa nodes x 12 cores. JDFTx was compiled by gcc 11.2, with mkl, OpenMPI, gsl, libxc and scalapack.
Symptoms: JDFTx starts proper number of threads, but only a few of cores are loaded, others are working for brief amount of time then idle. Example - starting 4 mpi processes 12 threads each leads to 4 loaded cores # 0, 4, 24, 28 with others idling. I tried nonthreaded mkl, recompiling OpenMPI, using --bind-to none in mpiexec/mpirun arguments but nothing changed the situation.
Short summary - MPI processes x 1 thread each - work Only OpenMP treads w/o MPI - works MPI/OpenMP - thread placement problem.
Below is list of libraries, to which JDFTx is linked.
ldd
which jdftx
linux-vdso.so.1 (0x00007ffd88fed000) libjdftx.so => /net/software/testing/software/jdftx/1.7.0-foss-2021b-mkl/lib/libjdftx.so (0x00001522929c3000) libmpi_cxx.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libmpi_cxx.so.40 (0x0000152293332000) libmpi.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libmpi.so.40 (0x0000152293206000) libgsl.so.25 => /net/software/testing/software/GSL/2.7-GCC-11.2.0/lib/libgsl.so.25 (0x000015229253b000) libmkl_scalapack_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_scalapack_lp64.so.1 (0x0000152291e0e000) libmkl_gf_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_gf_lp64.so.1 (0x0000152291270000) libmkl_blacs_openmpi_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.1 (0x00001522931bc000) libmkl_intel_lp64.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_intel_lp64.so.1 (0x00001522906d1000) libmkl_gnu_thread.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_gnu_thread.so.1 (0x000015228eb46000) libmkl_core.so.1 => /net/software/testing/software/imkl/2021.4.0/mkl/2021.4.0/lib/intel64/libmkl_core.so.1 (0x000015228a6d8000) libgomp.so.1 => /net/software/testing/software/GCCcore/11.2.0/lib64/libgomp.so.1 (0x0000152293175000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000015228a4b8000) libxc.so.9 => /net/software/testing/software/libxc/5.1.6-GCC-11.2.0/lib/libxc.so.9 (0x0000152289b94000) libstdc++.so.6 => /net/software/testing/software/GCCcore/11.2.0/lib64/libstdc++.so.6 (0x0000152289968000) libm.so.6 => /lib64/libm.so.6 (0x00001522895e6000) libgcc_s.so.1 => /net/software/testing/software/GCCcore/11.2.0/lib64/libgcc_s.so.1 (0x00001522895cc000) libc.so.6 => /lib64/libc.so.6 (0x0000152289207000) libopen-rte.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libopen-rte.so.40 (0x000015228914f000) libopen-orted-mpir.so => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libopen-orted-mpir.so (0x0000152293168000) libopen-pal.so.40 => /net/software/testing/software/OpenMPI/4.1.1-GCC-11.2.0/lib/libopen-pal.so.40 (0x000015228909f000) librt.so.1 => /lib64/librt.so.1 (0x0000152288e97000) libutil.so.1 => /lib64/libutil.so.1 (0x0000152288c93000) libhwloc.so.15 => /net/software/testing/software/hwloc/2.5.0-GCCcore-11.2.0/lib/libhwloc.so.15 (0x0000152288c38000) libpciaccess.so.0 => /net/software/testing/software/libpciaccess/0.16-GCCcore-11.2.0/lib/libpciaccess.so.0 (0x000015229315c000) libxml2.so.2 => /net/software/testing/software/libxml2/2.9.10-GCCcore-11.2.0/lib/libxml2.so.2 (0x0000152288aca000) libdl.so.2 => /lib64/libdl.so.2 (0x00001522888c6000) libz.so.1 => /net/software/testing/software/zlib/1.2.11-GCCcore-11.2.0/lib/libz.so.1 (0x00001522888ad000) liblzma.so.5 => /net/software/testing/software/XZ/5.2.5-GCCcore-11.2.0/lib/liblzma.so.5 (0x0000152288885000) libevent_core-2.1.so.7 => /net/software/testing/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_core-2.1.so.7 (0x000015228884e000) libevent_pthreads-2.1.so.7 => /net/software/testing/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_pthreads-2.1.so.7 (0x000015228884a000) /lib64/ld-linux-x86-64.so.2 (0x000015229312c000)