Runtime error when using cuSolver

abagusetty commented 2 years ago

Hi!

I was trying to get this tutorial with CUDA-backend on 4-V100 per node. I get into some strange errors related to cuSolver Ztrtri. I am not sure if the way I ran mpirun binds the MPI-ranks to each GPU internally inside jdftx. Just wondering if someone came across the following error:

---------- Setting up coulomb interaction ----------
Fluid mode embedding: using embedded box, but periodic Coulomb kernel.
(Fluid response is responsible for (approximate) separation between periodic images.)
Setting up double-sized grid for truncated Coulomb potentials:
R = 
[      5.23966     -2.61983            0  ]
[            0      4.53768            0  ]
[            0            0           72  ]
unit cell volume = 1711.86
G =
[    1.19916   0.692335         -0  ]
[          0    1.38467          0  ]
[          0         -0  0.0872665  ]
Chosen fftbox size, S = [  24  24  336  ]
Integer grid location selected as the embedding center:
   Grid: [  0  0  0  ]
   Lattice: [  0  0  0  ]
   Cartesian: [  0  0  0  ]
Constructing Wigner-Seitz cell: 8 faces (6 quadrilaterals, 2 hexagons)
Range-separation parameter for embedded mesh potentials due to point charges: 0.589462 bohrs.

Initializing DFT-D2 calculator for fluid / solvation:
    Pt:  C6:  815.23 Eh-a0^6  R0: 3.326 a0 (WARNING: beyond Grimme's data set)

---------- Setting up ewald sum ----------
Optimum gaussian width for ewald sums = 3.649540 bohr.
Real space sum over 1805 unit cells with max indices [  9  9  2  ]
Reciprocal space sum over 5103 terms with max indices [  4  4  31  ]

---------- Allocating electronic variables ----------
Initializing wave functions:  linear combination of atomic orbitals
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gpu01:52967] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/server/pmix_server.c at line 1741
[gpu01:52967] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/server/pmix_server.c at line 1741
Diagonal entry 17186608 is zero in CuSolver inversion routine Ztrtri.

Stack trace:

Ran with: mpirun -np 4 ../build_v100_nvhpc22.5/jdftx_gpu -i Neutral.in

Stack trace is below:

printStack(bool) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/Util.cpp:486
stackTraceExit(int) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/Util.cpp:591
ManagedMemory<int>::memFree() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/ManagedMemory.h:166
 (inlined by) ManagedArray<int>::free() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/ManagedMemory.h:305
 (inlined by) ManagedArray<int>::~ManagedArray() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/ManagedMemory.h:334
 (inlined by) orthoMatrix(matrix const&) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/matrixLinalg.cpp:495
ManagedMemory<complex>::memFree() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/ManagedMemory.h:166
 (inlined by) ManagedMemory<complex>::~ManagedMemory() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/ManagedMemory.h:61
 (inlined by) matrix::~matrix() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/matrix.h:54
 (inlined by) ElecVars::orthonormalize(int, matrix*) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/matrix.h:54
std::vector<ColumnBundle, std::allocator<ColumnBundle> >::operator[](unsigned long) at /soft/packaging/spack-builds/linux-opensuse_leap15-x86_64/gcc-10.2.0/gcc-10.2.0-yudlyezca7twgd5o3wkkraur7wdbngdn/include/c++/10.2.0/bits/stl_vector.h:1046
 (inlined by) ElecVars::LCAO() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/ElecVars_LCAO.cpp:192
ElecVars::setup(Everything const&) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/ElecVars.cpp:205
Everything::setup() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/Everything.cpp:147
main at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/jdftx.cpp:44
__libc_start_main at ??:?
_start at /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:122

Build, dependency version, etc:

# module purge
# module use /soft/modulefiles/
# module load spack cmake/3.22.1-gcc-10.2.0-3srlojv gcc/10.2.0-gcc-10.2.0-yudlyez gsl/2.7-gcc-10.2.0-pzzpuit fftw/3.3.9-gcc-10.2.0-am5ig34
# module load public_mkl/2019 nvhpc/nvhpc/22.5

rm -rfv build_v100_nvhpc22.5
mkdir -p build_v100_nvhpc22.5
cd build_v100_nvhpc22.5

CC=`which gcc` CXX=`which g++` cmake \
  -DCUDA_cublas_LIBRARY=/soft/compilers/nvhpc/Linux_x86_64/22.5/math_libs/11.7/lib64/libcublas.so \
  -DCUDA_cufft_LIBRARY=/soft/compilers/nvhpc/Linux_x86_64/22.5/math_libs/11.7/lib64/libcufft.so \
  -D EnableCUDA=yes \
  -D EnableCuSolver=yes \
  -D CudaAwareMPI=yes \
  -D EnableProfiling=no \
  -D CUDA_ARCH=compute_70 \
  -D CUDA_CODE=sm_70 \
  -D EnableProfiling=no \
  -D GSL_PATH=${GSL_ROOT} \
  -D EnableMKL=yes \
  -D MKL_PATH=${MKLROOT} \
  -D ForceFFTW=yes \
  -D FFTW3_PATH=${FFTW_ROOT}/lib \
  ../jdftx/jdftx

abagusetty commented 2 years ago

Update: I tried with just using 1 MPI-rank and 1 V100 by setting the CUDA_VISIBLE_DEVICES=0.

Output

``` abagusetty@gpu02 /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/target_mu_tutorial $ mpirun -np 1 ../build_v100_nvhpc22.5/jdftx_gpu -i Neutral.in *************** JDFTx 1.7.0 (git hash 1afe8502) *************** Start date and time: Sat Jun 25 18:10:52 2022 Executable ../build_v100_nvhpc22.5/jdftx_gpu with command-line: -i Neutral.in Running on hosts (process indices): gpu02 (0) Divided in process groups (process indices): 0 (0) gpuInit: Found compatible cuda device 0 'Tesla V100-SXM2-32GB' gpuInit: Selected device 0 Resource initialization completed at t[s]: 8.54 Run totals: 1 processes, 44 threads, 1 GPUs Input parsed successfully to the following command list (including defaults): basis kpoint-dependent coords-type Lattice core-overlap-check vector coulomb-interaction Slab 001 coulomb-truncation-embed 0 0 0 davidson-band-ratio 1.1 dump End State BoundCharge dump-name common.$VAR elec-cutoff 20 100 elec-eigen-algo Davidson elec-ex-corr gga-PBEsol elec-smearing Fermi 0.01 electronic-minimize \ dirUpdateScheme FletcherReeves \ linminMethod DirUpdateRecommended \ nIterations 100 \ history 15 \ knormThreshold 0 \ energyDiffThreshold 1e-08 \ nEnergyDiff 2 \ alphaTstart 1 \ alphaTmin 1e-10 \ updateTestStepSize yes \ alphaTreduceFactor 0.1 \ alphaTincreaseFactor 3 \ nAlphaAdjustMax 3 \ wolfeEnergy 0.0001 \ wolfeGradient 0.9 \ fdTest no electronic-scf \ nIterations 50 \ energyDiffThreshold 1e-08 \ residualThreshold 1e-07 \ mixFraction 0.5 \ qMetric 0.8 \ history 10 \ nEigSteps 2 \ eigDiffThreshold 1e-08 \ mixedVariable Density \ qKerker 0.8 \ qKappa -1 \ verbose no \ mixFractionMag 1.5 exchange-regularization WignerSeitzTruncated fluid LinearPCM 298.000000 1.013250 fluid-anion F- 1 MeanFieldLJ \ epsBulk 1 \ pMol 0 \ epsInf 1 \ Pvap 0 \ sigmaBulk 0 \ Rvdw 2.24877 \ Res 0 \ tauNuc 343133 fluid-cation Na+ 1 MeanFieldLJ \ epsBulk 1 \ pMol 0 \ epsInf 1 \ Pvap 0 \ sigmaBulk 0 \ Rvdw 2.19208 \ Res 0 \ tauNuc 343133 fluid-ex-corr lda-TF lda-PZ fluid-gummel-loop 10 1.000000e-05 fluid-minimize \ dirUpdateScheme PolakRibiere \ linminMethod DirUpdateRecommended \ nIterations 400 \ history 15 \ knormThreshold 1e-11 \ energyDiffThreshold 0 \ nEnergyDiff 2 \ alphaTstart 1 \ alphaTmin 1e-10 \ updateTestStepSize yes \ alphaTreduceFactor 0.1 \ alphaTincreaseFactor 3 \ nAlphaAdjustMax 6 \ wolfeEnergy 0.0001 \ wolfeGradient 0.9 \ fdTest no fluid-solvent H2O 55.338 ScalarEOS \ epsBulk 78.4 \ pMol 0.92466 \ epsInf 1.77 \ Pvap 1.06736e-10 \ sigmaBulk 4.62e-05 \ Rvdw 2.61727 \ Res 1.42 \ tauNuc 343133 \ poleEl 15 7 1 forces-output-coords Positions initial-state common.$VAR ion Pt 0.333333000000000 -0.333333000000000 -0.237676000000000 1 ion Pt -0.333333000000000 0.333333000000000 -0.118838000000000 1 ion Pt 0.000000000000000 0.000000000000000 0.000000000000000 1 ion Pt 0.333333000000000 -0.333333000000000 0.118838000000000 1 ion Pt -0.333333000000000 0.333333000000000 0.237676000000000 1 ion-species GBRV/$ID_pbesol.uspp ion-width Ecut ionic-minimize \ dirUpdateScheme L-BFGS \ linminMethod DirUpdateRecommended \ nIterations 0 \ history 15 \ knormThreshold 0.0001 \ energyDiffThreshold 1e-06 \ nEnergyDiff 2 \ alphaTstart 1 \ alphaTmin 1e-10 \ updateTestStepSize yes \ alphaTreduceFactor 0.1 \ alphaTincreaseFactor 3 \ nAlphaAdjustMax 3 \ wolfeEnergy 0.0001 \ wolfeGradient 0.9 \ fdTest no kpoint 0.000000000000 0.000000000000 0.000000000000 1.00000000000000 kpoint-folding 12 12 1 latt-move-scale 1 1 1 latt-scale 1 1 1 lattice Hexagonal 5.23966 36 lattice-minimize \ dirUpdateScheme L-BFGS \ linminMethod DirUpdateRecommended \ nIterations 0 \ history 15 \ knormThreshold 0 \ energyDiffThreshold 1e-06 \ nEnergyDiff 2 \ alphaTstart 1 \ alphaTmin 1e-10 \ updateTestStepSize yes \ alphaTreduceFactor 0.1 \ alphaTincreaseFactor 3 \ nAlphaAdjustMax 3 \ wolfeEnergy 0.0001 \ wolfeGradient 0.9 \ fdTest no lcao-params -1 1e-06 0.01 pcm-variant CANDLE spintype no-spin subspace-rotation-factor 1 yes symmetries automatic symmetry-threshold 0.0001 ---------- Setting up symmetries ---------- Found 24 point-group symmetries of the bravais lattice Found 12 space-group symmetries with basis Applied RMS atom displacement 2.70575e-06 bohrs to make symmetries exact. ---------- Initializing the Grid ---------- R = [ 5.23966 -2.61983 0 ] [ 0 4.53768 0 ] [ 0 0 36 ] unit cell volume = 855.932 G = [ 1.19916 0.692335 -0 ] [ 0 1.38467 0 ] [ 0 -0 0.174533 ] Minimum fftbox size, Smin = [ 24 24 164 ] Chosen fftbox size, S = [ 24 24 168 ] ---------- Initializing tighter grid for wavefunction operations ---------- R = [ 5.23966 -2.61983 0 ] [ 0 4.53768 0 ] [ 0 0 36 ] unit cell volume = 855.932 G = [ 1.19916 0.692335 -0 ] [ 0 1.38467 0 ] [ 0 -0 0.174533 ] Minimum fftbox size, Smin = [ 24 24 148 ] Chosen fftbox size, S = [ 24 24 150 ] ---------- Exchange Correlation functional ---------- Initalized PBEsol GGA exchange. Initalized PBEsol GGA correlation. ---------- Setting up pseudopotentials ---------- Width of ionic core gaussian charges (only for fluid interactions / plotting) set to 0.397384 Reading pseudopotential file '/gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/build_v100_nvhpc22.5/pseudopotentials/GBRV/pt_pbesol.uspp': Title: Pt. Created by USPP 7.3.6 on 22-9-15 Reference state energy: -104.899005. 16 valence electrons in orbitals: |510> occupation: 6 eigenvalue: -2.158847 |520> occupation: 9.5 eigenvalue: -0.336311 |600> occupation: 0 eigenvalue: -0.320324 |610> occupation: 0 eigenvalue: -0.119800 lMax: 2 lLocal: 3 QijEcut: 5.5 6 projectors sampled on a log grid with 745 points: l: 0 eig: -0.320324 rCut: 2.45 l: 0 eig: 1.500000 rCut: 2.45 l: 1 eig: -2.158847 rCut: 1.6 l: 1 eig: -0.119800 rCut: 1.6 l: 2 eig: -0.500000 rCut: 1.6 l: 2 eig: -0.336311 rCut: 1.6 Partial core density with radius 1.1 Transforming core density to a uniform radial grid of dG=0.02 with 1620 points. Transforming local potential to a uniform radial grid of dG=0.02 with 1620 points. Transforming nonlocal projectors to a uniform radial grid of dG=0.02 with 432 points. Transforming density augmentations to a uniform radial grid of dG=0.02 with 1620 points. Transforming atomic orbitals to a uniform radial grid of dG=0.02 with 432 points. Core radius for overlap checks: 2.45 bohrs. Initialized 1 species with 5 total atoms. Folded 1 k-points by 12x12x1 to 144 k-points. ---------- Setting up k-points, bands, fillings ---------- Reduced to 19 k-points under symmetry. Computing the number of bands and number of electrons Calculating initial fillings. nElectrons: 80.000000 nBands: 60 nStates: 19 ----- Setting up reduced wavefunction bases (one per k-point) ----- average nbasis = 3657.132 , ideal nbasis = 3656.607 ---------- Setting up coulomb interaction ---------- Fluid mode embedding: using embedded box, but periodic Coulomb kernel. (Fluid response is responsible for (approximate) separation between periodic images.) Setting up double-sized grid for truncated Coulomb potentials: R = [ 5.23966 -2.61983 0 ] [ 0 4.53768 0 ] [ 0 0 72 ] unit cell volume = 1711.86 G = [ 1.19916 0.692335 -0 ] [ 0 1.38467 0 ] [ 0 -0 0.0872665 ] Chosen fftbox size, S = [ 24 24 336 ] Integer grid location selected as the embedding center: Grid: [ 0 0 0 ] Lattice: [ 0 0 0 ] Cartesian: [ 0 0 0 ] Constructing Wigner-Seitz cell: 8 faces (6 quadrilaterals, 2 hexagons) Range-separation parameter for embedded mesh potentials due to point charges: 0.589462 bohrs. Initializing DFT-D2 calculator for fluid / solvation: Pt: C6: 815.23 Eh-a0^6 R0: 3.326 a0 (WARNING: beyond Grimme's data set) ---------- Setting up ewald sum ---------- Optimum gaussian width for ewald sums = 3.649540 bohr. Real space sum over 1805 unit cells with max indices [ 9 9 2 ] Reciprocal space sum over 5103 terms with max indices [ 4 4 31 ] ---------- Allocating electronic variables ---------- Initializing wave functions: linear combination of atomic orbitals Pt pseudo-atom occupations: s ( 0 ) p ( 6 0 ) d ( 10 ) Relative hermiticity error of 1.880569e-04 (>1e-10) encountered in diagonalize Stack trace: 0: /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/build_v100_nvhpc22.5/libjdftx_gpu.so(_Z10printStackb+0x27) [0x7f259239a7a7] 1: /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/build_v100_nvhpc22.5/libjdftx_gpu.so(_Z14stackTraceExiti+0xd) [0x7f259239ab9d] 2: /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/build_v100_nvhpc22.5/libjdftx_gpu.so(_ZNK6matrix11diagonalizeERS_R10diagMatrix+0x7b5) [0x7f25923a8fd5] 3: /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/build_v100_nvhpc22.5/libjdftx_gpu.so(_ZN8ElecVars4LCAOEv+0xe66) [0x7f2592464fe6] 4: /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/build_v100_nvhpc22.5/libjdftx_gpu.so(_ZN8ElecVars5setupERK10Everything+0x1c59) [0x7f259245ae49] 5: /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/build_v100_nvhpc22.5/libjdftx_gpu.so(_ZN10Everything5setupEv+0xf36) [0x7f25924876e6] 6: ../build_v100_nvhpc22.5/jdftx_gpu(main+0x6c3) [0x40c393] 7: /lib64/libc.so.6(__libc_start_main+0xef) [0x7f25511962bd] 8: ../build_v100_nvhpc22.5/jdftx_gpu(_start+0x2a) [0x40d41a] Writing 'jdftx-stacktrace' (for use with script printStackTrace): done. -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- ```

debug stack trace:

printStack(bool) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/Util.cpp:486
stackTraceExit(int) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/Util.cpp:591
matrix::diagonalize(matrix&, diagMatrix&) const at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/matrixLinalg.cpp:70
std::vector<matrix, std::allocator<matrix> >::operator[](unsigned long) at /soft/packaging/spack-builds/linux-opensuse_leap15-x86_64/gcc-10.2.0/gcc-10.2.0-yudlyezca7twgd5o3wkkraur7wdbngdn/include/c++/10.2.0/bits/stl_vector.h:1046
 (inlined by) ElecVars::LCAO() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/ElecVars_LCAO.cpp:240
ElecVars::setup(Everything const&) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/ElecVars.cpp:205
Everything::setup() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/Everything.cpp:147
main at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/jdftx.cpp:44
__libc_start_main at ??:?
_start at /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:122

shankar1729 commented 2 years ago

Hi Abhishek,

This looks most likely like a bug / api inconsistency in the cusolver library. For some reason, there have been lots of API changes in cusolver in the cuda 11 sequence.

Could you try compiling against slightly earlier cuda versions? You used 11.7; perhaps try one in the 11.1 - 11.4 range if available. Also, just in case, check whether these issues are specific to this calculation (unlikely) or more general to any jdftx calculation.

Best, Shankar

abagusetty commented 2 years ago

Hi @shankar1729 Thanks for getting back on this. You are correct, I did try to use 10.2.89 mostly relevant to V100s and this issue was resolved. I will try to use CUDA versions 11.1-11.4 or which ever is the default on Perlmutter to verify the above.

I will keep this issue open for the moment.

shankar1729 commented 1 year ago

Fixed by https://github.com/shankar1729/jdftx/commit/af38c4254be73ed7fb061bae8037fd1ca78c138c

shankar1729 / jdftx

Runtime error when using cuSolver #231