I.e., we do the dlopen() stuff because we didn't want to create a link-time dependency to libcuda -- e.g., if a cluster only has some nodes with GPUs (and the cluster only has the CUDA libraries installed on the nodes with GPUs).
But since we've taken that philosophy, why do we have configure-time tests for things like GDR? For example:
Since we don't know / can't guarantee that the libcuda that configure checked is the same one that was dlopen()'ed, shouldn't we check for the things that configure is checking at run time, after we successfully dlopen("libcuda.so.1", ...)? I.e., shouldn't we dlsym() to see if the successfully-opened libcuda.so.1 has the functionality that Open MPI is looking for?
@Akshay-Venkatesh Question for you about the CUDA support in Open MPI.
Per https://github.com/conda-forge/openmpi-feedstock/issues/42, we've been asked why there's
configure
-time checks for CUDA, but then we alsodlopen("libcuda.so.1", ...)
at runtime.I.e., we do the
dlopen()
stuff because we didn't want to create a link-time dependency tolibcuda
-- e.g., if a cluster only has some nodes with GPUs (and the cluster only has the CUDA libraries installed on the nodes with GPUs).But since we've taken that philosophy, why do we have
configure
-time tests for things like GDR? For example:https://github.com/open-mpi/ompi/blob/548ed56befd5ecc843d8b3938bf272360003efee/opal/mca/common/cuda/common_cuda.c#L102-L111
Since we don't know / can't guarantee that the
libcuda
thatconfigure
checked is the same one that wasdlopen()
'ed, shouldn't we check for the things thatconfigure
is checking at run time, after we successfullydlopen("libcuda.so.1", ...)
? I.e., shouldn't wedlsym()
to see if the successfully-openedlibcuda.so.1
has the functionality that Open MPI is looking for?FYI @leofang @dalcinl @jakirkham