Closed burlen closed 2 months ago
cudaSetDevice
should be called before MPI_Init for UCC to work correctly. It's not a Fortran/C issue. osu_alltoall benchmark has logic to find MPI implementation specific environment variables exposing a processes node and world rank before initialization. It would be useful to have a standardized way of accessing this info, either via standardized environment variable names or through some MPI API that's legal to call before MPI_Init.
Hi All,
I'm trying to use UCC to get better MPI_Alltoall performance over nvlink in a fortran code that uses OpenMP target offload. However, when I enable UCC performance is worse. With
UCC_LOG_LEVEL=trace
I can see the above error and UCC is apparently not used.It seems that there is some interaction between UCC, nvfortran and/or OpenMP offload. I am using OpenMPI, UCC, and nvfortran from the nvidia HPC SDK 24.03 downloaded from nvidia web site. I have also reproduced the issue with a recent release of OpenMPI(5.0.1), UCX(1.15.0) and UCC(1.2.0) compiled from source.
I have verified that UCC works on the system and delivers better performance using the OSU alltoall benchmark(ver 7.3). osu_alltoall is a C code not making use of OpenMP target offload which is why it looks like maybe some interaction with nvfortran and/or OpenMP target offload and UCC.
there is a 35 line reproducer here: https://github.com/burlen/ucctest description of the reproducer: Makefile(has compiler flags etc), main.F90(source code), run.sh(use to run it) and launch.sh(selects gpu and ib device).
steps to reproduce:
this requires the nvidia hpc sdk to be installed and in the path. The system I'm working on has 8 gpu's per node (connected with nvlink/nvswitch) and 2, 56 core cpu's. You might need to tweak run.sh if yours is different.