lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

QUDA doesn't link against UCX when NVSHMEM with UCX is used resulting in undefined references to ucp_* #1227

Open robert-mijakovic opened 2 years ago

robert-mijakovic commented 2 years ago

I'm compiling QUDA 1.1.0 using GCC 10.3.0., CUDA 11.3.1, OpenMPI, (external) Eigen, 3.3.9, and (external) NVSHMEM 2.4.1 on CentOS 8.4. The build is configured with:

 cmake <other-options> -DQUDA_GPU_ARCH=sm_80 -DQUDA  _MPI=ON -DQUDA_NVSHMEM=ON -DQUDA_NVSHMEM_HOME=$EBROOTNVSHMEM -DQUDA_DOWNLOAD_EIGEN=OFF -DPROPAGATED_FLAGS=" " -DMPIEXEC_EXECUTABLE="$(which srun)"

Build fails in the linking phase with undefined symbols to ucp_*. NVSHMEM is compiled with the UCX transport layer.

$ /apps/GCCcore/10.3.0/bin/g++ -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib64 -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -L/apps/CUDA/11.3.1/lib64 -L/apps/CUDA/11.3.1/lib -L/apps/Python/3.9.5-GCCcore-10.3.0/lib64 -L/apps/Python/3.9.5-GCCcore-10.3.0/lib -L/apps/FFTW/3.3.9-gompi-2021a/lib64 -L/apps/FFTW/3.3.9-gompi-2021a/lib -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib64 -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib64 -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib -L/apps/GCCcore/10.3.0/lib64 -L/apps/GCCcore/10.3.0/lib -Wl,-rpath -Wl,/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/apps/OpenMPI/4.1.1-GCC-10.3.0/lib -Wl,--enable-new-dtags -L/mnt/tier2/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -L/usr/lib64 -L/mnt/tier2/apps/OpenMPI/4.1.1-GCC-10.3.0/lib CMakeFiles/hisq_stencil_test.dir/hisq_stencil_test.cpp.o -o hisq_stencil_test  -Wl,-rpath,/dev/shm/QUDA/1.1.0/foss-2021a-CUDA-11.3.1/easybuild_obj/lib::::::::::::::::::::::: libquda_test.a ../lib/libquda.so /apps/CUDA/11.3.1/lib/libcudart_static.a -lpthread -ldl /usr/lib64/librt.so /usr/lib64/libcuda.so /apps/CUDA/11.3.1/lib/libcublas.so /apps/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -lnvshmem  -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib/stubs" -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib"
../lib/libquda.so: error: undefined reference to 'ucp_rkey_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_worker_flush_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_buffer_release'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_pack'
../lib/libquda.so: error: undefined reference to 'ucp_worker_set_am_recv_handler'
../lib/libquda.so: error: undefined reference to 'ucp_worker_query'
../lib/libquda.so: error: undefined reference to 'ucp_worker_create'
../lib/libquda.so: error: undefined reference to 'ucp_mem_map'
../lib/libquda.so: error: undefined reference to 'ucp_init_version'
../lib/libquda.so: error: undefined reference to 'ucp_config_modify'
../lib/libquda.so: error: undefined reference to 'ucp_config_read'
../lib/libquda.so: error: undefined reference to 'ucp_am_send_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_put_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_am_data_release'
../lib/libquda.so: error: undefined reference to 'ucp_config_release'
../lib/libquda.so: error: undefined reference to 'ucp_cleanup'
../lib/libquda.so: error: undefined reference to 'ucp_worker_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_request_check_status'
../lib/libquda.so: error: undefined reference to 'ucp_get_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_fence'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_op_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_progress'
../lib/libquda.so: error: undefined reference to 'ucp_ep_rkey_unpack'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_post'
../lib/libquda.so: error: undefined reference to 'ucp_request_free'
../lib/libquda.so: error: undefined reference to 'ucp_ep_close_nb'
../lib/libquda.so: error: undefined reference to 'ucp_ep_create'
../lib/libquda.so: error: undefined reference to 'ucp_worker_release_address'
../lib/libquda.so: error: undefined reference to 'ucp_worker_get_address'
../lib/libquda.so: error: undefined reference to 'ucp_mem_unmap'
collect2: error: ld returned 1 exit status

NVSHMEM is built with:

$ make  -j 1 NVSHMEM_MPI_SUPPORT=1 NVSHMEM_UCX_SUPPORT=1 UCX_HOME=$EBROOTUCX NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" NVSHMEM_USE    _NCCL=1 NVSHMEM_PMIX_SUPPORT=1

The issue is that QUDA doesn't link against UCX, -L$(UCX_HOME)/lib -lucs -lucp.

Looking into common.mk of NVSHMEM, I see that intention of NVIDIA is that codes that use it should link against UCX themselves, i.e., they expect QUDA to link against it.

ifeq ($(NVSHMEM_UCX_SUPPORT), 1)
TESTLDFLAGS += -L$(UCX_HOME)/lib -lucs -lucp
endif

I would add the flags myself but CMakeLists.txt of QUDA doesn't provide such an option.

mathiaswagner commented 2 years ago

Yes, UCX support in NVSHMEM is not supported in QUDA yet. QUDA uses cmake and nvshmem doesn't so any usage requirements propagation is limited. You should be able to specify additional linker flags using CMAKE_EXE_LINKER_FLAGS

robert-mijakovic commented 2 years ago

Thank you for the workaround. I have tested it and it worked well.