Issue installing on Lassen at LLNL

nkeilbart commented 1 year ago

Hello,

I was asked to help some colleagues get JDFT installed at LLNL. I have been successful in getting it to run on the Quartz and Ruby clusters we have but I am having difficulty compiling on our GPU cluster Lassen. Here is the build script I am using:

git checkout v1.7.0

module load gcc/7.3.1 cmake/3.23.1 lapack/3.9.0-gcc-7.3.1 cuda/11.6.1 spectrum-mpi/rolling-release

mkdir build cd build CC=gcc CXX=g++ cmake \ -D CMAKE_POLICY_DEFAULT_CMP0074=NEW \ -D EnableCUDA=yes \ -D EnableCuSolver=yes \ -D CudaAwareMPI=yes \ -D CUDA_ARCH=compute_70 \ -D CUDA_CODE=sm_70 \ -D EnableProfiling=yes \ -D LAPACK_LIBRARIES="${LAPACK_DIR}/liblapack.so" \ -D CMAKE_LIBRARY_PATH="${LD_LIBRARY_PATH//:/;}" \ ../jdftx make -j make test

This will fully compile without any issues and the tests begin to pass although I believe that's for the CPU version and not the GPU. This is the configuration output:

-- The C compiler identification is GNU 7.3.1 -- The CXX compiler identification is GNU 7.3.1 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/tce/packages/gcc/gcc-7.3.1/bin/gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/tce/packages/gcc/gcc-7.3.1/bin/g++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/tcetmp/bin/git (found version "2.29.1") -- Git revision hash: 1f477cce -- Looking for gsl_integration_glfixed_point -- Looking for gsl_integration_glfixed_point - found -- Found GSL: /lib64/libgsl.so -- Looking for pthread.h CMake Warning (dev) at /usr/tce/packages/cmake/cmake-3.23.1/share/cmake/Modules/CheckIncludeFile.cmake:82 (message): Policy CMP0075 is not set: Include file check macros honor CMAKE_REQUIRED_LIBRARIES. Run "cmake --help-policy CMP0075" for policy details. Use the cmake_policy command to set the policy and suppress this warning.

CMAKE_REQUIRED_LIBRARIES is set to:

/lib64/libgsl.so;/lib64/libgslcblas.so;;m

For compatibility with CMake 3.11 and below this check is ignoring it. Call Stack (most recent call first): /usr/tce/packages/cmake/cmake-3.23.1/share/cmake/Modules/FindThreads.cmake:146 (CHECK_INCLUDE_FILE) CMakeLists.txt:69 (find_package) This warning is for project developers. Use -Wno-dev to suppress it.

-- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- Found FFTW3: /lib64/libfftw3_threads.so /lib64/libfftw3.so -- Found LAPACK: /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so -- Found CBLAS: /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libcblas.so -- Found MPI_C: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so (found version "3.1") -- Found MPI_CXX: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so (found version "3.1") -- Found MPI: TRUE (found version "3.1")
-- Performing Test HAS_NO_UNUSED_RESULT -- Performing Test HAS_NO_UNUSED_RESULT - Success -- Performing Test HAS_TEMPLATE_DEPTH -- Performing Test HAS_TEMPLATE_DEPTH - Success -- Found CUDA: /usr/tce/packages/cuda/cuda-11.6.1/nvidia (found version "11.6") -- Found OpenMP_C: -fopenmp (found version "4.5") -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5")
-- CUDA_LIBRARIES = /usr/tce/packages/cuda/cuda-11.6.1/nvidia/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so;/usr/tce/packages/cuda/cuda-11.6.1/nvidia/lib64/libcublas.so;/usr/tce/packages/cuda/cuda-11.6.1/nvidia/lib64/libcufft.so;/usr/tce/packages/cuda/cuda-11.6.1/lib64/libcublasLt.so;/usr/tce/packages/cuda/cuda-11.6.1/lib64/libcusolver.so;-fopenmp -- CUDA_NVCC_FLAGS = -D_FORCE_INLINES;;-arch=compute_70;-code=sm_70;-DGPU_ENABLED;--compiler-options;-fpic -- Found Doxygen: /usr/bin/doxygen (found version "1.8.5") found components: doxygen dot -- Configuring done -- Generating done

When I attempt to run the binary on some test files they provided I am getting this output from the stacktrace.

/usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/libjdftx_gpu.so(_Z10printStackb+0x4c) [0x20000037456c] /usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/libjdftx_gpu.so(_Z14stackTraceExiti+0x24) [0x2000003748f4] /usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/libjdftx_gpu.so(_Z15sigErrorHandleri+0x78) [0x2000003749a8] [0x2000000504d8] [0x1] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libcollectives.so.3(_ZN4CCMI9Protocols9Allreduce18HybridAsyncReduceTINS_17ConnectionManager14CommSeqConnMgrENS_8Executor13HybridReduceTIS4_EEE5startEv+0x1b8) [0x20003bd6f008] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libcollectives.so.3(_ZN4CCMI9Protocols9Allreduce22AsyncAllreduceFactoryTINS1_18HybridAsyncReduceTINS_17ConnectionManager14CommSeqConnMgrENS_8Executor13HybridReduceTIS5_EEEES5_16libcoll_reduce_tXadL_ZN7LibColl7Adapter6getKeyEjjjPPvEEE8generateESDSD+0x264) [0x20003bd35904] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libcollectives.so.3(_ZN7LibColl7Adapter20StaticCollSelAdviser10autoSelectE19libcoll_xfer_type_tP14libcoll_xfer_tb+0x1d0) [0x20003bcbdd30] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libcollectives.so.3(LIBCOLL_AutoSelect+0x58) [0x20003bca3a18] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_coll_ibm.so(start_libcoll_blocking_collective+0x3d8) [0x20003bbad428] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_coll_ibm.so(mca_coll_ibm_reduce+0x1f8) [0x20003bbb4a58] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_coll_basic.so(mca_coll_basic_allreduce_intra+0x108) [0x20003bb23368] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_coll_cuda.so(mca_coll_cuda_allreduce+0x100) [0x20003bb51f90] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_coll_ibm.so(mca_coll_ibm_allreduce+0x510) [0x20003bbaf490] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so.3(PMPI_Allreduce+0x250) [0x200000eb2630] /usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/libjdftx_gpu.so(_ZNK6ExCorrclERKSt6vectorISt10shared_ptrI15ScalarFieldDataESaIS3_EEPS5_10IncludeTXCPS6_S8_P7matrix3IdE+0xeec) [0x20000044d63c] /usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/libjdftx_gpu.so(_ZNK6ExCorrclERKSt10shared_ptrI15ScalarFieldDataEPS2_10IncludeTXCPS3_S5_P7matrix3IdE+0x230) [0x20000044ffb0] /usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/libjdftx_gpu.so(_ZN7IonInfo6updateER8Energies+0x3dc) [0x20000047a63c] /usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/libjdftx_gpu.so(_ZN10Everything5setupEv+0xb94) [0x2000004419b4] /usr/gapps/qsg/codes/jdftx/lassen/v1.7.0/build/jdftx_gpu(main+0x5d4) [0x10004044] /lib64/libc.so.6(+0x25300) [0x200038955300] /lib64/libc.so.6(__libc_start_main+0xc4) [0x2000389554f4]

Let me know what other information I can provide. Thanks.

Nathan

shankar1729 commented 1 year ago

Hi Nathan,

Does the spectrum MPI linked here work correctly with GPU memory? That is the most likely error that could have triggered this: an issue in the cuda-aware MPI support.

Sometimes the cuda support is partial: works correctly for point-to-point, but broken in collective communications. See if adding --mca pml ob1 to the mpirun commandline fixes this.

Best, Shankar

nkeilbart commented 1 year ago

Hi Shankar,

I will give that a try but I'm not sure if it'll work. They have a specialized run command, lrun, on the server that takes care of a lot of the resource management. I'll try to see how that command fits in with that.

If it matters, other GPU enabled codes like VASP and QE have not had issues with this particular module either.

Nathan

shankar1729 commented 1 year ago

Hi Nathan,

From what I know, the GPU implementations in VASP and QE are somewhat more localized than what we do, and may not be doing the full range of MPI operations directly on GPU memory. In particular, it seems that the support for collective MPI operations on GPUs is still spotty in some MPI implementations, and it could be that VASP/QE make these calls on CPU memory instead.

Best, Shankar

nkeilbart commented 1 year ago

Hi Shankar,

I was able to dig a little deeper and it seems that it was correctly finding the right cuda libraries. It might have to do with what the MPI was compiled against. I saw it was compiled against and older 10.1 version so I loaded that instead when compiling. That seems to have done the trick for now. Thanks for the response.

Nathan

shankar1729 / jdftx

Issue installing on Lassen at LLNL #292