open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

.dylib warning from dlopen() for libcuda on Linux #6896

Open cparrott73 opened 5 years ago

cparrott73 commented 5 years ago

Thank you for taking the time to submit an issue!

.dylib warning from dlopen() for libcuda on linux

Open MPI v3.1.3

Open MPI was compiled with PGI 19.1 compilers from a source tarball downloaded from open-mpi.org. Open MPI is configured with CUDA support via the --with-cuda flag to ./configure.

Please describe the system on which you are running


Details of the problem

We have a user who is unhappy about the fact that Open MPI prints a warning about being unable to dlopen() libcuda.dylib on a Linux system when the runtime is trying to open the CUDA shared library. This can be seen in the following output:

--------------------------------------------------------------------------
The library attempted to open the following supporting CUDA libraries,
but each of them failed. CUDA-aware support is disabled.
libcuda.so.1: cannot open shared object file: No such file or directory
libcuda.dylib: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.so.1: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.dylib: cannot open shared object file: No such file or directory
If you are not interested in CUDA-aware support, then run with
--mca mpi_cuda_support 0 to suppress this message. If you are interested
in CUDA-aware support, then try setting LD_LIBRARY_PATH to the location
of libcuda.so.1 to get passed this issue.
--------------------------------------------------------------------------

I do understand why this is happening: Open MPI does not have a notion of whether it should dlopen() a file ending in .dylib or .so on the system, so it tries them in succession until one succeeds, or it exhausts all possibilities. These messages are coming out of dlopen(). In essence, this is really nothing more than a cosmetic issue. However, the user thinks this is a 'red herring' which might be confusing to users on a Linux system.

Would you consider perhaps a small change to test whether the file exists before trying to dlopen() it? That might help clear up this overall warning message a little bit. Though I would also understand if you chose not to, as I do appreciate that this is mainly a cosmetic issue.

Thanks in advance.

Akshay-Venkatesh commented 5 years ago

Hi, Chris.

Is it possible that openmpi was built on a system where there was a valid /usr/lib64/libcuda.so.1 and then there is an attempt to use the library on another machine where libcuda library isn't present?

The following lines of code are responsible for the messages you're seeing.

https://github.com/open-mpi/ompi/blob/master/opal/mca/common/cuda/common_cuda.c#L386 https://github.com/open-mpi/ompi/blob/master/opal/mca/common/cuda/common_cuda.c#L435 https://github.com/open-mpi/ompi/blob/master/opal/mca/common/cuda/help-mpi-common-cuda.txt#L164

I'm not sure if it is a red herring though. Do you say this because cuda-aware MPI works as expected?

cparrott73 commented 5 years ago

Hi, Chris.

Is it possible that openmpi was built on a system where there was a valid /usr/lib64/libcuda.so.1 and then there is an attempt to use the library on another machine where libcuda library isn't present?

The following lines of code are responsible for the messages you're seeing.

https://github.com/open-mpi/ompi/blob/master/opal/mca/common/cuda/common_cuda.c#L386 https://github.com/open-mpi/ompi/blob/master/opal/mca/common/cuda/common_cuda.c#L435 https://github.com/open-mpi/ompi/blob/master/opal/mca/common/cuda/help-mpi-common-cuda.txt#L164

I'm not sure if it is a red herring though. Do you say this because cuda-aware MPI works as expected?

Hi Akshay,.

Yes, that is correct. This Open MPI build was linked against CUDA on our build system, and then installed to a NFS directory where it can be shared among various systems on our network. Some of the systems have GPUs and CUDA installed, while others do not. Obviously we do not see this warning on the CUDA-enabled systems, just the non-CUDA ones.