Open BerndDoser opened 4 years ago
seems like a problem with UCX installation (or with EasyBuild)
can you double check that UCX and OpenMPI were installed and run
ldd <ompi_dir>/lib/openmpi/mca_pml_ucx.so
?
ldd
shows:
# ldd /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/openmpi/mca_pml_ucx.so
linux-vdso.so.1 => (0x00007ffe89bbd000)
libmpi.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libmpi.so.40 (0x00007f97cd648000)
libopen-rte.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-rte.so.40 (0x00007f97cd58d000)
libopen-pal.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-pal.so.40 (0x00007f97cd47b000)
libucp.so.0 => /lib64/libucp.so.0 (0x00007f97cd233000)
libuct.so.0 => /lib64/libuct.so.0 (0x00007f97ccfaf000)
libucm.so.0 => /lib64/libucm.so.0 (0x00007f97ccd9d000)
libucs.so.0 => /lib64/libucs.so.0 (0x00007f97cca49000)
librt.so.1 => /lib64/librt.so.1 (0x00007f97cc841000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f97cc63e000)
libhwloc.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/hwloc/1.11.12-GCCcore-8.3.0/lib/libhwloc.so.5 (0x00007f97cc5fc000)
libnuma.so.1 => /hits/basement/its/doserbd/easybuild/haswell/apps/numactl/2.0.12-GCCcore-8.3.0/lib/libnuma.so.1 (0x00007f97cc5ef000)
libpciaccess.so.0 => /hits/basement/its/doserbd/easybuild/haswell/apps/libpciaccess/0.14-GCCcore-8.3.0/lib/libpciaccess.so.0 (0x00007f97cd561000)
libxml2.so.2 => /hits/basement/its/doserbd/easybuild/haswell/apps/libxml2/2.9.9-GCCcore-8.3.0/lib/libxml2.so.2 (0x00007f97cc486000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f97cc282000)
libz.so.1 => /hits/basement/its/doserbd/easybuild/haswell/apps/zlib/1.2.11-GCCcore-8.3.0/lib/libz.so.1 (0x00007f97cc269000)
liblzma.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/XZ/5.2.4-GCCcore-8.3.0/lib/liblzma.so.5 (0x00007f97cc242000)
libm.so.6 => /lib64/libm.so.6 (0x00007f97cbf40000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f97cbd24000)
libc.so.6 => /lib64/libc.so.6 (0x00007f97cb957000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f97cb73d000)
libmlx5.so.1 => /usr/lib64/libmlx5.so.1 (0x00007f97cb4e2000)
libibcm.so.1 => /usr/lib64/libibcm.so.1 (0x00007f97cb2dc000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f97cb0c0000)
/lib64/ld-linux-x86-64.so.2 (0x00007f97cd53d000)
libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00007f97cae52000)
libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00007f97cac31000)
interesting, looks like all files are found.. is it on haswell-094?
Damn, you're right. It was on the master node. On haswell-094 they are missing:
linux-vdso.so.1 => (0x00007ffe63f9f000)
libmpi.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libmpi.so.40 (0x00007fdd3885d000)
libopen-rte.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-rte.so.40 (0x00007fdd387a3000)
libopen-pal.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-pal.so.40 (0x00007fdd38690000)
libucp.so.0 => not found
libuct.so.0 => not found
libucm.so.0 => not found
libucs.so.0 => not found
librt.so.1 => /lib64/librt.so.1 (0x00007fdd38488000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007fdd38285000)
libhwloc.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/hwloc/1.11.12-GCCcore-8.3.0/lib/libhwloc.so.5 (0x00007fdd38243000)
libnuma.so.1 => /hits/basement/its/doserbd/easybuild/haswell/apps/numactl/2.0.12-GCCcore-8.3.0/lib/libnuma.so.1 (0x00007fdd3878b000)
libpciaccess.so.0 => /hits/basement/its/doserbd/easybuild/haswell/apps/libpciaccess/0.14-GCCcore-8.3.0/lib/libpciaccess.so.0 (0x00007fdd38781000)
libxml2.so.2 => /hits/basement/its/doserbd/easybuild/haswell/apps/libxml2/2.9.9-GCCcore-8.3.0/lib/libxml2.so.2 (0x00007fdd380da000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fdd37ed6000)
libz.so.1 => /lib64/libz.so.1 (0x00007fdd37cc0000)
liblzma.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/XZ/5.2.4-GCCcore-8.3.0/lib/liblzma.so.5 (0x00007fdd37c99000)
libm.so.6 => /lib64/libm.so.6 (0x00007fdd37997000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdd3777b000)
libc.so.6 => /lib64/libc.so.6 (0x00007fdd373ae000)
/lib64/ld-linux-x86-64.so.2 (0x00007fdd38752000)
Could you please say me, how I can install the missing libraries?
Well.. there are multiple options.. how did you install them on master node?
Good question. I didn't install them by myself. But I think the issue is solved anyway. Thank you very much for your fast help.
Now, I have installed OpenMPI without UCX support. The line
mca_base_component_repository_open: unable to open mca_pml_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
has vanished, but the remaining error message is still there:
[haswell-103][[26717,1],10][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-103][[26717,1],12][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: haswell-103
Local device: mlx4_0
--------------------------------------------------------------------------
[haswell-103][[26717,1],8][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-103][[26717,1],14][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-103][[26717,1],13][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
Any ideas?
We ran into this error with MOFED with upstream libraries instead of the Mellanox legacy libraries and a version of OpenMPI built against the legacy libraries.
@keenandr over a year later this simple little statement helped me debug a similar error. Thank you.
Background information
HPL benchmarks on a cluster with mellanox infiniband shows warnings and the execution times is slower than expected.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v3.1.4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed with EasyBuild 4.1.1 module 'HPL/2.3-foss-2019b'
Please describe the system on which you are running
Details of the problem
The execution of the slurm script
shows following error message:
I have seen bug report #5810, which looks similar, but should be fixed in OpenMPI 3.1.4.
Many thanks in advance for your help!
Best regards, Bernd