Open SeyedMir opened 6 months ago
An answer to this would be good. There seems to be a number of issues with the latest openmpi packages. This is not confined only to the git repo. The same issue exists in the tarball.
What's in your $LD_LIBRARY_PATH
? ldd will pick the first shared library matching the requested name from the LD_LIBRARY_PATH and the standard folders (/lib, /usr/lib, ...). As your ldd
picks the opal library from the standard path, it could indicate that your LD_LIBRARY_PATH is not correctly set. Here is a pointer to our FAQ covering this topic.
just fyi, I just built latest v5.0.x and I don't see this issue. My LD_LIBRARY_PATH is empty
/global/home/users/tomislavj/ompi-build/install/lib
[tomislavj@thor001 lib]$ ldd libmpi.so | grep open-pal
libopen-pal.so.80 => /global/home/users/tomislavj/ompi-build/install/lib/libopen-pal.so.80 (0x000015133b1e8000)
I used the same configure as @SeyedMir
@ParticleTruthSeeker please point to the other packaging issues.
Hi all, and thank you for taking the time to look into this issue. So this problem originates because applications i built using my prior OpenMPI installation kept complaining about inability to access shared memory so I thought I might finally attempt to solve the problem but I have come across a host of issues in the various versions from 4.1.6 to 5.0.x.
Variously concerning opal or now from the tarball I am using for 5.0.5 on Debian 12, it either fails to build the examples as per below.
mpifort -g ring_usempif08.f90 -o ring_usempif08
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
[user:190501] *** Process received signal ***
[user:190501] Signal: Segmentation fault (11)
[user:190501] Signal code: Address not mapped (1)
[user:190501] Failing at address: 0x28
[user:190501] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7fe1b1df0050]
[user:190501] [ 1] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_mem_hooks_unregister_release+0x45)[0x7fe1b1fc4bc5]
[user:190501] [ 2] /lib64/ld-linux-x86-64.so.2(+0x112a)[0x7fe1b220d12a]
[user:190501] [ 3] /lib64/ld-linux-x86-64.so.2(+0x481e)[0x7fe1b221081e]
[user:190501] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e55d)[0x7fe1b1df255d]
[user:190501] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e69a)[0x7fe1b1df269a]
[user:190501] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x27251)[0x7fe1b1ddb251]
[user:190501] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe1b1ddb305]
[user:190501] [ 8] oshmem_info(+0x28a1)[0x558e67a968a1]
[user:190501] *** End of error message ***
Segmentation fault (core dumped)
[user:190513] *** Process received signal ***
[user:190513] Signal: Segmentation fault (11)
[user:190513] Signal code: Address not mapped (1)
[user:190513] Failing at address: 0x28
[user:190513] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7f0e499e5050]
[user:190513] [ 1] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_mem_hooks_unregister_release+0x45)[0x7f0e49bb9bc5]
[user:190513] [ 2] /lib64/ld-linux-x86-64.so.2(+0x112a)[0x7f0e49e0212a]
[user:190513] [ 3] /lib64/ld-linux-x86-64.so.2(+0x481e)[0x7f0e49e0581e]
[user:190513] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e55d)[0x7f0e499e755d]
[user:190513] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x3e69a)[0x7f0e499e769a]
[user:190513] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x27251)[0x7f0e499d0251]
[user:190513] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f0e499d0305]
[user:190513] [ 8] oshmem_info(+0x28a1)[0x56154ebd28a1]
[user:190513] *** End of error message ***
Segmentation fault (core dumped)
This message exists with my full LD_LIBRARY_PATH where I define the Install directory followed by :{LD_LIBRARY_PATH} If i then attempt to empty the LD_LIBRARY_PATH by manually only exporting the install directory/lib i get the following error.
mpicc -g hello_c.c -o hello_c
#mpicc -g ring_c.c -o ring_c
mpicc -g connectivity_c.c -o connectivity_c
mpicc -g spc_example.c -o spc_example
make[1]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
mpifort -g hello_mpifh.f -o hello_mpifh
mpifort -g ring_mpifh.f -o ring_mpifh
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
mpifort -g hello_usempi.f90 -o hello_usempi
mpifort -g ring_usempi.f90 -o ring_usempi
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
mpifort -g hello_usempif08.f90 -o hello_usempif08
mpifort -g ring_usempif08.f90 -o ring_usempif08
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
[user:192039] *** Process received signal ***
[user:192039] Signal: Segmentation fault (11)
[user:192039] Signal code: Address not mapped (1)
[user:192039] Failing at address: 0x28
[user:192039] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7ff89401a050]
[user:192039] [ 1] /usr/local/lib/libopen-pal.so.40(opal_mem_hooks_unregister_release+0x45)[0x7ff8941ed895]
[user:192039] [ 2] /lib64/ld-linux-x86-64.so.2(+0x112a)[0x7ff89449612a]
[user:192039] [ 3] /lib64/ld-linux-x86-64.so.2(+0x481e)[0x7ff89449981e]
[user:192039] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x3e55d)[0x7ff89401c55d]
[user:192039] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x3e69a)[0x7ff89401c69a]
[user:192039] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x27251)[0x7ff894005251]
[user:192039] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7ff894005305]
[user:192039] [ 8] oshmem_info(+0x28a1)[0x562dae0128a1]
[user:192039] *** End of error message ***
Segmentation fault (core dumped)
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g hello_oshmem_c.c -o hello_oshmem
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:154: hello_oshmem] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemc++ -g hello_oshmem_cxx.cc -o hello_oshmemcxx
Cannot open configuration file /usr/local/share/openmpi/shmemc++-wrapper-data.txt
Error parsing data file shmemc++: Not found
make[2]: *** [Makefile:156: hello_oshmemcxx] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g ring_oshmem_c.c -o ring_oshmem
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:161: ring_oshmem] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g oshmem_shmalloc.c -o oshmem_shmalloc
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:166: oshmem_shmalloc] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g oshmem_circular_shift.c -o oshmem_circular_shift
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:169: oshmem_circular_shift] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g oshmem_max_reduction.c -o oshmem_max_reduction
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:172: oshmem_max_reduction] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g oshmem_strided_puts.c -o oshmem_strided_puts
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:175: oshmem_strided_puts] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[2]: Entering directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
shmemcc -g oshmem_symmetric_data.c -o oshmem_symmetric_data
Cannot open configuration file /usr/local/share/openmpi/shmemcc-wrapper-data.txt
Error parsing data file shmemcc: Not found
make[2]: *** [Makefile:178: oshmem_symmetric_data] Error 243
make[2]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make[1]: *** [Makefile:102: oshmem] Error 2
make[1]: Leaving directory '/home/user/OpenMPI/openmpi-5.0.5/examples'
make: *** [Makefile:77: all] Error 2
This occurs no matter which way I build OpenMPI. Even when I build it with an empty LD_LIBRARY_PATH. Either way my runpaths are not being obeyed. I understand there is an rpath runpath issue, but I have defined the LD_LIBRARY_PATH correctly.
In this instance it doesnt seem to have build OpenSHMEM
Let's start with the basics:
1) What is the installation directory for Open MPI? 2) What is the output of ldd on your application binary?
I see two paths, /home/scratch.hmirsadeghi_sw/repos/ompi/_build_rel_v5.0.3/_install/
and /usr/local/lib/
with potential Open MPI libraries. That hints at a potential conflict between a system-installed Open MPI and the Open MPI you installed yourself.
@ParticleTruthSeeker please open a new issue since this is a different one, and do provide the required information.
Assuming you are testing the MPI library you just built, if you want to run make
from the examples
directory, you first have to
make install
$PATH
to use the newly installed MPI wrappers$LD_LIBRARY_PATH
does not point to the previous install.hmirsadeghi_sw
Sure. I have however done the things you mention. It now appears to not be building libopen-rte.so.40 for some reason
oshmem_info: symbol lookup error: /usr/lib/x86_64-linux-gnu/libopen-rte.so.40: undefined symbol: opal_hwloc_binding_policy
oshmem_info: symbol lookup error: /usr/lib/x86_64-linux-gnu/libopen-rte.so.40: undefined symbol: opal_hwloc_binding_policy
This is after manually making the symbolic link the original poster mentions.
@ParticleTruthSeeker Like I said, open a new issue and provide all the required information if you need help.
Background information
What version of Open MPI are you using?
v5.0.3
tag of the git repo.Describe how Open MPI was installed
Installed from git clone. Configured as below (after
./autogen.pl
):If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
After building Open MPI, the resulting
libmpi.so
is linked with an existinglibopen-pal.so.40
on the system which does not provide the needed symbols. As a result, usingmpicc
leads to errors like below:Using
mpirun
leads to the error below:libmpi.so.40: undefined symbol: opal_smsc_base_framework
Some more details:
This happens despite the fact that the correct
libopen-pal
files are built and exist in thelib
directory of the prefix:As a dirty work around, I have to create a
libopen-pal.so.40
symlink to the correctlibopen-pal.so
in the installation lib path (I already setLD_LIBRARY_PATH
to the prefix lib).So, my question is why is
libmpi.so
linked with alibopen-pal.so.40
that does not provide the symbols it needs? and how can I avoid that?