open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

libmpi.so is linked with the wrong libopen-pal.so #12567

Open SeyedMir opened 6 months ago

SeyedMir commented 6 months ago

Background information

What version of Open MPI are you using? v5.0.3 tag of the git repo.

Describe how Open MPI was installed

Installed from git clone. Configured as below (after ./autogen.pl):

--enable-mpirun-prefix-by-default --with-cuda=$CUDA_HOME --with-cuda-libdir=$CUDA_HOME/lib64/stubs --with-ucx=$UCX_HOME --with-ucx-libdir=$UCX_HOME/lib --enable-mca-no-build=btl-uct --with-pmix=internal --with-hwloc=internal --with-libevent=internal --with-slurm

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

+6f81bfd163f3275d2b0630974968c82759dd4439 3rd-party/openpmix (v1.1.3-3983-g6f81bfd1)
+4f27008906d96845e22df6502d6a9a29d98dec83 3rd-party/prrte (psrvr-v2.0.0rc1-4746-g4f27008906)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running


Details of the problem

After building Open MPI, the resulting libmpi.so is linked with an existing libopen-pal.so.40 on the system which does not provide the needed symbols. As a result, using mpicc leads to errors like below:

./bin/mpicc test.c
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `mca_common_sm_fini'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_common_ucx_support_level'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_finalize_set_domain'
/usr/bin/ld: /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/lib/libmpi.so: undefined reference to `opal_built_with_rocm_support'

Using mpirun leads to the error below: libmpi.so.40: undefined symbol: opal_smsc_base_framework

Some more details:

readelf -d libmpi.so | grep NEEDED | grep open-pal
 0x0000000000000001 (NEEDED)             Shared library: [libopen-pal.so.40]
ldd libmpi.so | grep open-pal
        libopen-pal.so.40 => /lib/x86_64-linux-gnu/libopen-pal.so.40

This happens despite the fact that the correct libopen-pal files are built and exist in the lib directory of the prefix:

libopen-pal.so       
libopen-pal.so.80    
libopen-pal.so.80.0.3

As a dirty work around, I have to create a libopen-pal.so.40 symlink to the correct libopen-pal.so in the installation lib path (I already set LD_LIBRARY_PATH to the prefix lib).

So, my question is why is libmpi.so linked with a libopen-pal.so.40 that does not provide the symbols it needs? and how can I avoid that?

ParticleTruthSeeker commented 2 weeks ago

An answer to this would be good. There seems to be a number of issues with the latest openmpi packages. This is not confined only to the git repo. The same issue exists in the tarball.

bosilca commented 2 weeks ago

What's in your $LD_LIBRARY_PATH ? ldd will pick the first shared library matching the requested name from the LD_LIBRARY_PATH and the standard folders (/lib, /usr/lib, ...). As your ldd picks the opal library from the standard path, it could indicate that your LD_LIBRARY_PATH is not correctly set. Here is a pointer to our FAQ covering this topic.

janjust commented 2 weeks ago

just fyi, I just built latest v5.0.x and I don't see this issue. My LD_LIBRARY_PATH is empty

/global/home/users/tomislavj/ompi-build/install/lib
[tomislavj@thor001 lib]$ ldd libmpi.so | grep open-pal
        libopen-pal.so.80 => /global/home/users/tomislavj/ompi-build/install/lib/libopen-pal.so.80 (0x000015133b1e8000)

I used the same configure as @SeyedMir

ggouaillardet commented 2 weeks ago

@ParticleTruthSeeker please point to the other packaging issues.

ParticleTruthSeeker commented 1 week ago

Hi all, and thank you for taking the time to look into this issue. So this problem originates because applications i built using my prior OpenMPI installation kept complaining about inability to access shared memory so I thought I might finally attempt to solve the problem but I have come across a host of issues in the various versions from 4.1.6 to 5.0.x.

Variously concerning opal or now from the tarball I am using for 5.0.5 on Debian 12, it either fails to build the examples as per below.

mpifort -g  ring_usempif08.f90  -o ring_usempif08
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
[user:190501] *** Process received signal ***
[user:190501] Signal: Segmentation fault (11)
[user:190501] Signal code: Address not mapped (1)
[user:190501] Failing at address: 0x28
[user:190501] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7fe1b1df0050]
[user:190501] [ 1] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_mem_hooks_unregister_release+0x45)[0x7fe1b1fc4bc5]
[user:190501] [ 2] /lib64/ld-linux-x86-64.so.2(+0x112a)[0x7fe1b220d12a]
[user:190501] [ 3] /lib64/ld-linux-x86-64.so.2(+0x481e)[0x7fe1b221081e]
[user:190501] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e55d)[0x7fe1b1df255d]
[user:190501] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e69a)[0x7fe1b1df269a]
[user:190501] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x27251)[0x7fe1b1ddb251]
[user:190501] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe1b1ddb305]
[user:190501] [ 8] oshmem_info(+0x28a1)[0x558e67a968a1]
[user:190501] *** End of error message ***
Segmentation fault (core dumped)
[user:190513] *** Process received signal ***
[user:190513] Signal: Segmentation fault (11)
[user:190513] Signal code: Address not mapped (1)
[user:190513] Failing at address: 0x28
[user:190513] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7f0e499e5050]
[user:190513] [ 1] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_mem_hooks_unregister_release+0x45)[0x7f0e49bb9bc5]
[user:190513] [ 2] /lib64/ld-linux-x86-64.so.2(+0x112a)[0x7f0e49e0212a]
[user:190513] [ 3] /lib64/ld-linux-x86-64.so.2(+0x481e)[0x7f0e49e0581e]
[user:190513] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e55d)[0x7f0e499e755d]
[user:190513] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e69a)[0x7f0e499e769a]
[user:190513] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x27251)[0x7f0e499d0251]
[user:190513] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f0e499d0305]
[user:190513] [ 8] oshmem_info(+0x28a1)[0x56154ebd28a1]
[user:190513] *** End of error message ***
Segmentation fault (core dumped)

This message exists with my full LD_LIBRARY_PATH where I define the Install directory followed by :{LD_LIBRARY_PATH} If i then attempt to empty the LD_LIBRARY_PATH by manually only exporting the install directory/lib i get the following error.

mpicc -g  hello_c.c  -o hello_c
#mpicc -g  ring_c.c  -o ring_c
mpicc -g  connectivity_c.c  -o connectivity_c
mpicc -g  spc_example.c  -o spc_example
make[1]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
mpifort -g  hello_mpifh.f  -o hello_mpifh
mpifort -g  ring_mpifh.f  -o ring_mpifh
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
mpifort -g  hello_usempi.f90  -o hello_usempi
mpifort -g  ring_usempi.f90  -o ring_usempi
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
mpifort -g  hello_usempif08.f90  -o hello_usempif08
mpifort -g  ring_usempif08.f90  -o ring_usempif08
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
[user:192039] *** Process received signal ***
[user:192039] Signal: Segmentation fault (11)
[user:192039] Signal code: Address not mapped (1)
[user:192039] Failing at address: 0x28
[user:192039] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7ff89401a050]
[user:192039] [ 1] /usr/local/lib/libopen-pal.so.40(opal_mem_hooks_unregister_release+0x45)[0x7ff8941ed895]
[user:192039] [ 2] /lib64/ld-linux-x86-64.so.2(+0x112a)[0x7ff89449612a]
[user:192039] [ 3] /lib64/ld-linux-x86-64.so.2(+0x481e)[0x7ff89449981e]
[user:192039] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x3e55d)[0x7ff89401c55d]
[user:192039] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x3e69a)[0x7ff89401c69a]
[user:192039] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x27251)[0x7ff894005251]
[user:192039] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7ff894005305]
[user:192039] [ 8] oshmem_info(+0x28a1)[0x562dae0128a1]
[user:192039] *** End of error message ***
Segmentation fault (core dumped)
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g  hello_oshmem_c.c  -o hello_oshmem
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:154: hello_oshmem] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemc++ -g  hello_oshmem_cxx.cc  -o hello_oshmemcxx
Cannot open configuration file /usr/local/share/openmpi/shmemc++-wrapper-data.txt
Error parsing data file shmemc++: Not found
make[2]: *** [Makefile:156: hello_oshmemcxx] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g  ring_oshmem_c.c  -o ring_oshmem
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:161: ring_oshmem] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g  oshmem_shmalloc.c  -o oshmem_shmalloc
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:166: oshmem_shmalloc] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g  oshmem_circular_shift.c  -o oshmem_circular_shift
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:169: oshmem_circular_shift] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g  oshmem_max_reduction.c  -o oshmem_max_reduction
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:172: oshmem_max_reduction] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g  oshmem_strided_puts.c  -o oshmem_strided_puts
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:175: oshmem_strided_puts] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g  oshmem_symmetric_data.c  -o oshmem_symmetric_data
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:178: oshmem_symmetric_data] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: *** [Makefile:102: oshmem] Error 2
make[1]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make: *** [Makefile:77: all] Error 2

This occurs no matter which way I build OpenMPI. Even when I build it with an empty LD_LIBRARY_PATH. Either way my runpaths are not being obeyed. I understand there is an rpath runpath issue, but I have defined the LD_LIBRARY_PATH correctly.

In this instance it doesnt seem to have build OpenSHMEM

devreal commented 1 week ago

Let's start with the basics:

1) What is the installation directory for Open MPI? 2) What is the output of ldd on your application binary?

I see two paths, /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/ and /usr/local/lib/ with potential Open MPI libraries. That hints at a potential conflict between a system-installed Open MPI and the Open MPI you installed yourself.

ggouaillardet commented 1 week ago

@ParticleTruthSeeker please open a new issue since this is a different one, and do provide the required information.

Assuming you are testing the MPI library you just built, if you want to run make from the examples directory, you first have to

ParticleTruthSeeker commented 1 week ago

hmirsadeghi_sw

Sure. I have however done the things you mention. It now appears to not be building libopen-rte.so.40 for some reason

oshmem_info: symbol lookup error: /usr/lib/x86_64-linux-gnu/libopen-rte.so.40: undefined symbol: opal_hwloc_binding_policy
oshmem_info: symbol lookup error: /usr/lib/x86_64-linux-gnu/libopen-rte.so.40: undefined symbol: opal_hwloc_binding_policy

This is after manually making the symbolic link the original poster mentions.

ggouaillardet commented 1 week ago

@ParticleTruthSeeker Like I said, open a new issue and provide all the required information if you need help.