open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Oddity: configure and dlopen checks for CUDA support #7334

Open jsquyres opened 4 years ago

jsquyres commented 4 years ago

@Akshay-Venkatesh Question for you about the CUDA support in Open MPI.

Per https://github.com/conda-forge/openmpi-feedstock/issues/42, we've been asked why there's configure-time checks for CUDA, but then we also dlopen("libcuda.so.1", ...) at runtime.

I.e., we do the dlopen() stuff because we didn't want to create a link-time dependency to libcuda -- e.g., if a cluster only has some nodes with GPUs (and the cluster only has the CUDA libraries installed on the nodes with GPUs).

But since we've taken that philosophy, why do we have configure-time tests for things like GDR? For example:

https://github.com/open-mpi/ompi/blob/548ed56befd5ecc843d8b3938bf272360003efee/opal/mca/common/cuda/common_cuda.c#L102-L111

Since we don't know / can't guarantee that the libcuda that configure checked is the same one that was dlopen()'ed, shouldn't we check for the things that configure is checking at run time, after we successfully dlopen("libcuda.so.1", ...)? I.e., shouldn't we dlsym() to see if the successfully-opened libcuda.so.1 has the functionality that Open MPI is looking for?

FYI @leofang @dalcinl @jakirkham

jakirkham commented 4 years ago

@Akshay-Venkatesh, would you have time to look at this, I think we would really benefit from your insight here 🙂

cc @kkraus14