Open christgau opened 9 months ago
@christgau I think you're slurping in another component when configuring/building OMPI. In my case I also had to build io-romio341 component as a dso.
--enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda,coll-cuda,io-romio341
Can you try it, please?
@janjust Thanks for your input. I can confirm that adding io-romio341
to the list of MCA DSOs removes the dependency on libcudart from libmpi.
I'm not sure how obvious this is to others, so I suggest to add the full list (see above?!) to the documentation.
Besides that, with Open MPI build like that the check code falsely reports that CUDA support is missing without MPI_Init
- even on a node with CUDA runtime/driver installed. Having added the initial MPI call, everything works as expected:
non-gpu-node $ ./check-with-init
Compile time check:
This MPI library has CUDA-aware support.
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_accelerator_cuda: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_rcache_rgpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_rcache_gpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_btl_smcuda: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
Run time check:
This MPI library does not have CUDA-aware support.
gpu-node $ ./check-with-init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library has CUDA-aware support.
gpu-node $ ./check-no-init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library does not have CUDA-aware support.
Maybe docs should be updated accordingly as well.
I agree, in the meantime I'll make a feature issue request out of this.
Background information
What version of Open MPI are you using?
v5.0.1
Describe how Open MPI was installed
Open MPI was installed from Github release tarball. Configuration was done using this command line:
Note that I added coll-cuda to the list of mca-dsos. I'm not sure if it is intentionally missing in the documentation. I also tried without coll-cuda first, but with the same outcome.
CUDA Toolkit version 12.3 was installed in
CUDA_ROOT
. UCX was built against that CUDA toolkit. On cluster nodes with the drivers installed,ucx_info -d
reports the relevant CUDA and gdrcopy transports.Remark: The host used for compilation has the CUDA toolkit and runtime installed, but not the driver. So using
stubs
appears to be the way to go in that case (see #12264)Please describe the system on which you are running
Details of the problem
With Open MPI 4.1.4,I was able to build it such that one could compile and run binaries without the need of having the CUDA toolkit, runtime and drivers available on the node in use. However, with 5.0.1 configured as shown above, the linker warns about missing libcudart when building a binary (even a basic
MPI_Init/MPI_Finalize
program):With 4.1.4 I am able to compile and launch without those warnings/errors while having a CUDA-aware MPI. For 4.1.4 it was not the case that libmpi depends on libcudart, although 4.1.4 was configured using
--with-cuda=...
.If I got the SC'23 BoF slides correct, I understand that with 5.x Open MPI intends to integrate (link?) plugins directly into libmpi. But with the
enable-mca-dso
configure option I tried to put all CUDA related components into DSOs and thus away from libmpi. Nevertheless, libmpi has libcudart as a shared library dependency (see above). I also checked the symbols which libmpi needs but it does not appear to require any stuff from libcudart:So it appears to me that libmpi unnecessarily depends on libcudart. Is there some bug in the configure/compilation process or is it not possible anymore to build Open MPI libraries such that one can compile applications without CUDA runtime libraries being available? Given the dependency to libcudart of libmpi the statement from the documentation
does not appear to apply here. Or is there something wrong on my side?
Btw: The test program from the documentation may also deserve a call to
MPI_Init
in case one follows the DSO approach. Otherwise, it is reported that there is no CUDA support (using OMPI v5.0.1 with CUDA toolkit 12.3 available for compilation/execution):