lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
294 stars 100 forks source link

infinite hang when autotuning is disabled #1477

Closed amwe210 closed 4 months ago

amwe210 commented 5 months ago

Program hangs without error after output file prints "MAKING PATH TABLES" with QUDA_ENABLE_TUNING set to 0. When I enable tuning and set the cache file location with QUDA_RESOURCE_PATH, the program hangs without error after outputting "cublasCreated successfully". When the program is hanging, it is actively using CPUs, but not utilizing any GPU compute. Wondering if this is an issue for anyone else.

maddyscientist commented 5 months ago

We haven't seen this issue. Hopefully this is easy to fix.

amwe210 commented 5 months ago

Thanks for the quick reply. I'm using two Nvidia A100X GPUs with the compilers and libraries from Nvidia HPC-SDK 24.1, gcc 11.4.0, ubuntu 22.04.4.

I'm running the MILC spectrum code from the NERSC10 Lattice QCD benchmark. My run script is:

export QUDA_ENABLE_TUNING=0 mpirun --mca btl_tcp_if_include ibs2 -np 2 -host ${HOST1},${HOST2} -x LD_LIBRARY_PATH ./ks_spectrum_hisq ./input_4864

cmake command:

cmake \ -G "Unix Makefiles" \ -DCMAKE_BUILD_TYPE=RELEASE \ -DCMAKE_CXX_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpiCC \ -DCMAKE_C_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpicc \ -DCMAKE_Fortran_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpifort \ -DCMAKE_CUDA_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/compilers/bin/nvcc \ -DQUDA_GPU_ARCH=sm_80 \ -DQUDA_DIRAC_DEFAULT_OFF=ON \ -DQUDA_DIRAC_STAGGERED=ON \ -DQUDA_FORCE_HISQ=ON \ -DQUDA_FORCE_GAUGE=ON \ -DQUDA_MPI=ON \ -DCMAKE_INSTALL_PREFIX=${QUDA_INSTALL_PREFIX} \ ../

maddyscientist commented 4 months ago

Thanks for the info @amwe210. I think the next to do is to work out where it's hanging, confirm if it's hanging in QUDA or MILC. On a hanging job, can you attach gdb to it, and get the backtrace?

What network are you running on?

Is this a regression versus a prior known good version of QUDA?

amwe210 commented 4 months ago

@maddyscientist I was able to find the problem and it was unrelated to the QUDA install. There was an issue with conflicting mpi libraries on my system. Thank you for your assistance. I will close this issue, since QUDA is working as expected.