Closed abagusetty closed 1 year ago
Update: I tried with just using 1 MPI-rank and 1 V100 by setting the CUDA_VISIBLE_DEVICES=0
.
debug stack trace:
printStack(bool) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/Util.cpp:486
stackTraceExit(int) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/Util.cpp:591
matrix::diagonalize(matrix&, diagMatrix&) const at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/core/matrixLinalg.cpp:70
std::vector<matrix, std::allocator<matrix> >::operator[](unsigned long) at /soft/packaging/spack-builds/linux-opensuse_leap15-x86_64/gcc-10.2.0/gcc-10.2.0-yudlyezca7twgd5o3wkkraur7wdbngdn/include/c++/10.2.0/bits/stl_vector.h:1046
(inlined by) ElecVars::LCAO() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/ElecVars_LCAO.cpp:240
ElecVars::setup(Everything const&) at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/ElecVars.cpp:205
Everything::setup() at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/electronic/Everything.cpp:147
main at /gpfs/jlse-fs0/users/abagusetty/projects/JDFTX/jdftx/jdftx/jdftx.cpp:44
__libc_start_main at ??:?
_start at /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:122
Hi Abhishek,
This looks most likely like a bug / api inconsistency in the cusolver library. For some reason, there have been lots of API changes in cusolver in the cuda 11 sequence.
Could you try compiling against slightly earlier cuda versions? You used 11.7; perhaps try one in the 11.1 - 11.4 range if available. Also, just in case, check whether these issues are specific to this calculation (unlikely) or more general to any jdftx calculation.
Best, Shankar
Hi @shankar1729 Thanks for getting back on this. You are correct, I did try to use 10.2.89 mostly relevant to V100s and this issue was resolved. I will try to use CUDA versions 11.1-11.4 or which ever is the default on Perlmutter to verify the above.
I will keep this issue open for the moment.
Hi!
I was trying to get this tutorial with CUDA-backend on 4-V100 per node. I get into some strange errors related to cuSolver Ztrtri. I am not sure if the way I ran mpirun binds the MPI-ranks to each GPU internally inside jdftx. Just wondering if someone came across the following error:
Ran with:
mpirun -np 4 ../build_v100_nvhpc22.5/jdftx_gpu -i Neutral.in
Stack trace is below:
Build, dependency version, etc: