openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

Fortran OpenMP+ target offload: TL_CUDA DEBUG cannot create CUDA TL context without active CUDA context #952

Closed burlen closed 2 months ago

burlen commented 2 months ago

Hi All,

I'm trying to use UCC to get better MPI_Alltoall performance over nvlink in a fortran code that uses OpenMP target offload. However, when I enable UCC performance is worse. With UCC_LOG_LEVEL=trace I can see the above error and UCC is apparently not used.

./run.sh | grep TL_CUDA
[1712070582.931909] [eos0143:1210996:0]     tl_cuda_lib.c:35   TL_CUDA DEBUG initialized lib object: 0x839c10
[1712070582.932056] [eos0143:1210996:0] tl_cuda_context.c:43   TL_CUDA DEBUG cannot create CUDA TL context without active CUDA context
[1712070582.932758] [eos0143:1210997:0]     tl_cuda_lib.c:35   TL_CUDA DEBUG initialized lib object: 0x66c380
[1712070582.932897] [eos0143:1210997:0] tl_cuda_context.c:43   TL_CUDA DEBUG cannot create CUDA TL context without active CUDA context
[1712070583.619655] [eos0143:1210996:0]     tl_cuda_lib.c:41   TL_CUDA DEBUG finalizing lib object: 0x839c10
[1712070583.620122] [eos0143:1210997:0]     tl_cuda_lib.c:41   TL_CUDA DEBUG finalizing lib object: 0x66c380

It seems that there is some interaction between UCC, nvfortran and/or OpenMP offload. I am using OpenMPI, UCC, and nvfortran from the nvidia HPC SDK 24.03 downloaded from nvidia web site. I have also reproduced the issue with a recent release of OpenMPI(5.0.1), UCX(1.15.0) and UCC(1.2.0) compiled from source.

I have verified that UCC works on the system and delivers better performance using the OSU alltoall benchmark(ver 7.3). osu_alltoall is a C code not making use of OpenMP target offload which is why it looks like maybe some interaction with nvfortran and/or OpenMP target offload and UCC.

there is a 35 line reproducer here: https://github.com/burlen/ucctest description of the reproducer: Makefile(has compiler flags etc), main.F90(source code), run.sh(use to run it) and launch.sh(selects gpu and ib device).

steps to reproduce:

git clone https://github.com/burlen/ucctest.git
cd ucctest
make
./run.sh 

this requires the nvidia hpc sdk to be installed and in the path. The system I'm working on has 8 gpu's per node (connected with nvlink/nvswitch) and 2, 56 core cpu's. You might need to tweak run.sh if yours is different.

burlen commented 2 months ago

cudaSetDevice should be called before MPI_Init for UCC to work correctly. It's not a Fortran/C issue. osu_alltoall benchmark has logic to find MPI implementation specific environment variables exposing a processes node and world rank before initialization. It would be useful to have a standardized way of accessing this info, either via standardized environment variable names or through some MPI API that's legal to call before MPI_Init.