QUDA doesn't link against UCX when NVSHMEM with UCX is used resulting in undefined references to ucp_*

robert-mijakovic commented 2 years ago

I'm compiling QUDA 1.1.0 using GCC 10.3.0., CUDA 11.3.1, OpenMPI, (external) Eigen, 3.3.9, and (external) NVSHMEM 2.4.1 on CentOS 8.4. The build is configured with:

 cmake <other-options> -DQUDA_GPU_ARCH=sm_80 -DQUDA  _MPI=ON -DQUDA_NVSHMEM=ON -DQUDA_NVSHMEM_HOME=$EBROOTNVSHMEM -DQUDA_DOWNLOAD_EIGEN=OFF -DPROPAGATED_FLAGS=" " -DMPIEXEC_EXECUTABLE="$(which srun)"

Build fails in the linking phase with undefined symbols to ucp_*. NVSHMEM is compiled with the UCX transport layer.

$ /apps/GCCcore/10.3.0/bin/g++ -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib64 -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -L/apps/CUDA/11.3.1/lib64 -L/apps/CUDA/11.3.1/lib -L/apps/Python/3.9.5-GCCcore-10.3.0/lib64 -L/apps/Python/3.9.5-GCCcore-10.3.0/lib -L/apps/FFTW/3.3.9-gompi-2021a/lib64 -L/apps/FFTW/3.3.9-gompi-2021a/lib -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib64 -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib64 -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib -L/apps/GCCcore/10.3.0/lib64 -L/apps/GCCcore/10.3.0/lib -Wl,-rpath -Wl,/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/apps/OpenMPI/4.1.1-GCC-10.3.0/lib -Wl,--enable-new-dtags -L/mnt/tier2/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -L/usr/lib64 -L/mnt/tier2/apps/OpenMPI/4.1.1-GCC-10.3.0/lib CMakeFiles/hisq_stencil_test.dir/hisq_stencil_test.cpp.o -o hisq_stencil_test  -Wl,-rpath,/dev/shm/QUDA/1.1.0/foss-2021a-CUDA-11.3.1/easybuild_obj/lib::::::::::::::::::::::: libquda_test.a ../lib/libquda.so /apps/CUDA/11.3.1/lib/libcudart_static.a -lpthread -ldl /usr/lib64/librt.so /usr/lib64/libcuda.so /apps/CUDA/11.3.1/lib/libcublas.so /apps/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -lnvshmem  -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib/stubs" -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib"
../lib/libquda.so: error: undefined reference to 'ucp_rkey_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_worker_flush_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_buffer_release'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_pack'
../lib/libquda.so: error: undefined reference to 'ucp_worker_set_am_recv_handler'
../lib/libquda.so: error: undefined reference to 'ucp_worker_query'
../lib/libquda.so: error: undefined reference to 'ucp_worker_create'
../lib/libquda.so: error: undefined reference to 'ucp_mem_map'
../lib/libquda.so: error: undefined reference to 'ucp_init_version'
../lib/libquda.so: error: undefined reference to 'ucp_config_modify'
../lib/libquda.so: error: undefined reference to 'ucp_config_read'
../lib/libquda.so: error: undefined reference to 'ucp_am_send_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_put_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_am_data_release'
../lib/libquda.so: error: undefined reference to 'ucp_config_release'
../lib/libquda.so: error: undefined reference to 'ucp_cleanup'
../lib/libquda.so: error: undefined reference to 'ucp_worker_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_request_check_status'
../lib/libquda.so: error: undefined reference to 'ucp_get_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_fence'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_op_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_progress'
../lib/libquda.so: error: undefined reference to 'ucp_ep_rkey_unpack'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_post'
../lib/libquda.so: error: undefined reference to 'ucp_request_free'
../lib/libquda.so: error: undefined reference to 'ucp_ep_close_nb'
../lib/libquda.so: error: undefined reference to 'ucp_ep_create'
../lib/libquda.so: error: undefined reference to 'ucp_worker_release_address'
../lib/libquda.so: error: undefined reference to 'ucp_worker_get_address'
../lib/libquda.so: error: undefined reference to 'ucp_mem_unmap'
collect2: error: ld returned 1 exit status

NVSHMEM is built with:

$ make  -j 1 NVSHMEM_MPI_SUPPORT=1 NVSHMEM_UCX_SUPPORT=1 UCX_HOME=$EBROOTUCX NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" NVSHMEM_USE    _NCCL=1 NVSHMEM_PMIX_SUPPORT=1

The issue is that QUDA doesn't link against UCX, -L$(UCX_HOME)/lib -lucs -lucp.

Looking into common.mk of NVSHMEM, I see that intention of NVIDIA is that codes that use it should link against UCX themselves, i.e., they expect QUDA to link against it.

ifeq ($(NVSHMEM_UCX_SUPPORT), 1)
TESTLDFLAGS += -L$(UCX_HOME)/lib -lucs -lucp
endif

I would add the flags myself but CMakeLists.txt of QUDA doesn't provide such an option.

mathiaswagner commented 2 years ago

Yes, UCX support in NVSHMEM is not supported in QUDA yet. QUDA uses cmake and nvshmem doesn't so any usage requirements propagation is limited. You should be able to specify additional linker flags using CMAKE_EXE_LINKER_FLAGS

robert-mijakovic commented 2 years ago

Thank you for the workaround. I have tested it and it worked well.

lattice / quda

QUDA doesn't link against UCX when NVSHMEM with UCX is used resulting in undefined references to ucp_* #1227