Closed ocaisa closed 7 months ago
Hi, the feature you requested already exists in libfabric, you just need to configure libfabric with --enable-cuda-dlopen
(along with --with-cuda
).
--enable-cuda-dlopen
will use dlopen
to open cuda library during runtime and load symbol from the library.
Please clarify what OFI providers you are using. Note the PSM3 provider when built with CUDA support allows jobs with and without CUDA by PSM3_CUDA env variables and uses dlopen to load the CUDA libraries when requested.
Ok, that's great, I will try it out. From what I can see in the repos, EFA also supports CUDA, do you know if it will work in a similar way?
EFA also supports CUDA, do you know if it will work in a similar way?
Yes.
libfabric core defined a set of cuda_ops
, that other providers uses for cuda.
The code is:
https://github.com/ofiwg/libfabric/blob/a28c5f85d09da8244c87cc6c6df0868306c62de7/src/hmem_cuda.c#L46
That's excellent,thanks a lot!
So the CUDA runtime is still required at compile time due to needing the header files cuda.h
and cuda_runtime.h
, we are trying to work from the scenario where these are not available. Will see if I can come up with a patch.
I'm doubtful you can have runtime cuda without using the headers at build time. For example you need the cuda headers to define the data structures and constants applicable to the cuda functions which will be called. We certainly would not want to replicate portions of the cuda headers into various OFI providers.
Well, you can have it, but it would require replicating the necessary parts of those header files (e.g, https://github.com/gcc-mirror/gcc/blob/master/include/cuda/cuda.h).
Just wanted to clarify our intent.
Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa
and psm3
without CUDA support and with --enable-<provider>=dl
, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH
to have those picked up.
We use environment modules, so the base libfabric
module would act as a non-CUDA build (but libfabric itself would be "CUDA-ready"), and then we would have a libfabric-CUDA
module that can be loaded on top (which loads a dependent CUDA module and sets FI_PROVIDER_PATH
) to enable CUDA-aware capabilities.
A few important considerations:
Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-
=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.
I'm with Todd on not copying in cuda*.h definitions for the core to use. Also, CUDA is just one of the many accelerators libfabric supports (see the list here). Whatever solution we come up with, lets make sure it is uniformly handled across all supported HMEM types.
Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-
=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.
One possible issue of this approach is that the EFA provider you built with CUDA support might not have shm support, hence it will work but not efficient.
This is because EFA provider use shm provider to implement the shm support, and shm provider might be availble when you built EFA as a standalone library.
Anything left to be done here? Note that if a DL provider is built with CUDA support, it doesn't require the libfabric core to be built in the same way --- the HMEM support code (e.g. hmem.c, hmem_cuda.c) is compiled into the DL provider directly.
I will close this issue if no objection is heard by the end of this week.
Thank for addressing this!
Is your feature request related to a problem? Please describe.
In EasyBuild we've been able to split out CUDA support in UCX into a separate (additional) plugin installation, and have tweaked our OpenMPI installation to essentially defer CUDA detection to runtime (by using an internal CUDA header for the configuration step similar to what GCC does for their GPU offloading), https://github.com/easybuilders/easybuild-easyconfigs/pull/15528).
Describe the solution you'd like How hard would it be to do something similar with
libfabric
? Can we patch to configure CUDA support with such an internal header file? Is there any cost to always configuring CUDA (there is in OpenMPI, but we have minimised this with an additional patch)? Can we leverageFI_PROVIDER_PATH
to shadow the original providers of the main installation with CUDA-enabled alternates.Are there any obvious issues you see with this approach?
Additional context We don't want to maintain CUDA-enabled and non-CUDA enabled MPI toolchains, what we want is that when CUDA is required as a dependency we automatically load
UCX-CUDA
andlibfabric-CUDA
as well which triggers all available CUDA support in the MPI layer.