Closed nicolin closed 1 month ago
Couple of suggestions at this point:
NVSHMEM_DISABLE_CUDA_VMM=1
see https://docs.nvidia.com/nvshmem/api/gen/env.html If none of those work please share the output with NVSHMEM_DEBUG
see https://docs.nvidia.com/nvshmem/api/gen/env.html
Updating to 3.0 fixed it. Thanks
NVSHEM build and installed in home folder as per instructions.
QUDA built with NSHEM: cmake -DQUDA_GPU_ARCH=sm_80 -DQUDA_BUILD_SHAREDLIB=ON -DQUDA_BUILD_ALL_TESTS=OFF -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_CLOVER=ON -DQUDA_DIRAC_WILSON=ON -DQUDA_MULTIGRID=ON -DQUDA_MAX_MULTI_BLAS_N=9 -DQUDA_INTERFACE_MILC=OFF -DQUDA_INTERFACE_CPS=OFF -DQUDA_INTERFACE_TIFR=OFF -DQUDA_QIO=OFF -DQUDA_QDPJIT=OFF -DQUDA_INTERFACE_QDPJIT=OFF -DQUDA_INTERFACE_QDP=OFF -DQUDA_QMP=OFF -DQUDA_MPI=ON -DQUDA_NVSHMEM=ON -DQUDA_NVSHMEM_HOME=/home/dp006/dp006/dc-gove1/buildNVshem/NVSHEM -DQMP_DIR=/home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/libs/QMP/lib/cmake/QMP -DLLVM_DIR=/home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/libs/LLVM14/lib/cmake/llvm -DLIBXML2_INCLUDE_DIR=/home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/libs/LLVM14/include/llvm -DLIBXML2_LIBRARY=/home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/libs/libxml2/lib64/libxml2.so -DLIBXML2_INCLUDE_DIR=/home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/libs/libxml2/include/libxml2/libxml/ -DQDPXX_DIR=/home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/libs/QDPJIT/lib/cmake/QDPXX -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=/home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/libs/QUDA_MPINVSHEM -DQUDA_RESOURCE_PATH=/home/dp006/dp006/dc-gove1/BenchOct/Cases/QUDACACHE -S /home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/quda
-- Found Git: /usr/bin/git (found version "2.39.3") -- -- QUDA 1.1.0 (882179286) ** -- cmake version: 3.27.4 -- Source location: /home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/quda -- Build location: /home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/buildQUDA_MPINVSHEM -- Build type: RELEASE -- QUDA target: CUDA -- The CXX compiler identification is GNU 9.3.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /mnt/lustre/tursafs1/apps/gcc/9.3.0/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The C compiler identification is GNU 9.3.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /mnt/lustre/tursafs1/apps/gcc/9.3.0/bin/gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- CPM: Adding package Eigen@3.4.0 (3.4.0) -- Found MPI_C: /mnt/lustre/tursafs1/apps/basestack/cuda-12.3/openmpi/4.1.5-cuda12.3-slurm/lib/libmpi.so (found version "3.1") -- Found MPI_CXX: /mnt/lustre/tursafs1/apps/basestack/cuda-12.3/openmpi/4.1.5-cuda12.3-slurm/lib/libmpi_cxx.so (found version "3.1") -- Found MPI: TRUE (found version "3.1")
-- QUDA_mdw_fused_Ls=4,8,12,16,20 -- QUDA_MULTIGRID_NVEC_LIST=6,24,32 -- QUDA_MULTIGRID_MRHS_LIST=16 -- Found CUDAToolkit: /mnt/lustre/tursafs1/apps/cuda/12.3/include (found version "12.3.107") -- The CUDA compiler identification is NVIDIA 12.3.107 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /mnt/lustre/tursafs1/apps/cuda/12.3/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- CUDA Compiler is/mnt/lustre/tursafs1/apps/cuda/12.3/bin/nvcc -- Compiler ID is NVIDIA -- CUDA Build Type: NVCC -- Large kernel arguments supported: ON -- Max number of rhs per kernel: 64 -- Heterogeneous atomics supported: ON -- QUDA_MULTIGRID_MRHS_LIST=16 -- Performing Test QUDA_LINKER_COMPRESS -- Performing Test QUDA_LINKER_COMPRESS - Success -- Performing Test QUDA_COMPRESS_DEBUG -- Performing Test QUDA_COMPRESS_DEBUG - Success -- ctest will run on 1 processes -- Configuring done (27.9s) -- Generating done (2.8s)
When running using Slurm on Tursa ( Dirac UK): filename: /lib/modules/4.18.0-477.27.1.el8_8.x86_64/extra/nv_peer_mem.ko version: 1.3-0 license: Dual BSD/GPL description: NVIDIA GPU memory plug-in author: Yishai Hadas rhelversion: 8.8 srcversion: CDFFB29AC90806C2FD1E591 depends: ib_core,nvidia name: nv_peer_mem vermagic: 4.18.0-477.27.1.el8_8.x86_64 SMP mod_unload modversions parm: enable_dbg:enable debug tracing (int)
mpirun -np 4 /home/dp006/dp006/dc-gove1/BenchOct/FASTSUM/QCDSolvers/buildCHROMA_QUDA_MPINVSHEM/mainprogs/main/hmc -L :/home/dp006/dp006/dc-gove1/buildNVshem/NVSHEM/lib -lnvshmem_host -lnvshmem_device -i /home/dp006/dp006/dc-gove1/BenchOct/Cases/24x24x24x128/SimpleCase.xml -o /home/dp006/dp006/dc-gove1/BenchOct/Results/SimpleCase_NVSHEM_NGPU_4.xml -geom 1 1 1 4
WARNING: Init NVSHMEM /home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting
/home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting
/home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting
/home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting
/home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/util/cs.cpp:23: non-zero status: 16: No such file or directory, exiting... mutex destroy failed
/home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/util/cs.cpp:23: non-zero status: 16: File exists, exiting... mutex destroy failed
/home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/util/cs.cpp:23: non-zero status: 16: File exists, exiting... mutex destroy failed
/home/dp006/dp006/dc-gove1/nvshmem_src_2.11.0-5/src/util/cs.cpp:23: non-zero status: 16: File exists, exiting... mutex destroy failed