open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

Error initializing an OpenFabrics device using Mellanox InfiniBand #7461

Open BerndDoser opened 4 years ago

BerndDoser commented 4 years ago

Background information

HPL benchmarks on a cluster with mellanox infiniband shows warnings and the execution times is slower than expected.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v3.1.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed with EasyBuild 4.1.1 module 'HPL/2.3-foss-2019b'

Please describe the system on which you are running


Details of the problem

The execution of the slurm script

ml HPL/2.3-foss-2019b
mpirun xhpl

shows following error message:

[haswell-094][[62851,1],3][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   haswell-094
  Local device: mlx4_0
--------------------------------------------------------------------------
[haswell-094][[62851,1],1][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-094][[62851,1],2][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-094][[62851,1],0][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-094.cluster.intern:06943] mca_base_component_repository_open: unable to open mca_pml_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
[haswell-094.cluster.intern:06947] mca_base_component_repository_open: unable to open mca_pml_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
[haswell-094.cluster.intern:06944] mca_base_component_repository_open: unable to open mca_pml_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
[haswell-094][[62851,1],7][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success

I have seen bug report #5810, which looks similar, but should be fixed in OpenMPI 3.1.4.

Many thanks in advance for your help!

Best regards, Bernd

yosefe commented 4 years ago

seems like a problem with UCX installation (or with EasyBuild) can you double check that UCX and OpenMPI were installed and run ldd <ompi_dir>/lib/openmpi/mca_pml_ucx.so?

BerndDoser commented 4 years ago

ldd shows:

# ldd /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/openmpi/mca_pml_ucx.so 
    linux-vdso.so.1 =>  (0x00007ffe89bbd000)
    libmpi.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libmpi.so.40 (0x00007f97cd648000)
    libopen-rte.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-rte.so.40 (0x00007f97cd58d000)
    libopen-pal.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-pal.so.40 (0x00007f97cd47b000)
    libucp.so.0 => /lib64/libucp.so.0 (0x00007f97cd233000)
    libuct.so.0 => /lib64/libuct.so.0 (0x00007f97ccfaf000)
    libucm.so.0 => /lib64/libucm.so.0 (0x00007f97ccd9d000)
    libucs.so.0 => /lib64/libucs.so.0 (0x00007f97cca49000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f97cc841000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007f97cc63e000)
    libhwloc.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/hwloc/1.11.12-GCCcore-8.3.0/lib/libhwloc.so.5 (0x00007f97cc5fc000)
    libnuma.so.1 => /hits/basement/its/doserbd/easybuild/haswell/apps/numactl/2.0.12-GCCcore-8.3.0/lib/libnuma.so.1 (0x00007f97cc5ef000)
    libpciaccess.so.0 => /hits/basement/its/doserbd/easybuild/haswell/apps/libpciaccess/0.14-GCCcore-8.3.0/lib/libpciaccess.so.0 (0x00007f97cd561000)
    libxml2.so.2 => /hits/basement/its/doserbd/easybuild/haswell/apps/libxml2/2.9.9-GCCcore-8.3.0/lib/libxml2.so.2 (0x00007f97cc486000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f97cc282000)
    libz.so.1 => /hits/basement/its/doserbd/easybuild/haswell/apps/zlib/1.2.11-GCCcore-8.3.0/lib/libz.so.1 (0x00007f97cc269000)
    liblzma.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/XZ/5.2.4-GCCcore-8.3.0/lib/liblzma.so.5 (0x00007f97cc242000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f97cbf40000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f97cbd24000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f97cb957000)
    libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f97cb73d000)
    libmlx5.so.1 => /usr/lib64/libmlx5.so.1 (0x00007f97cb4e2000)
    libibcm.so.1 => /usr/lib64/libibcm.so.1 (0x00007f97cb2dc000)
    librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f97cb0c0000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f97cd53d000)
    libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00007f97cae52000)
    libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00007f97cac31000)
yosefe commented 4 years ago

interesting, looks like all files are found.. is it on haswell-094?

BerndDoser commented 4 years ago

Damn, you're right. It was on the master node. On haswell-094 they are missing:

linux-vdso.so.1 =>  (0x00007ffe63f9f000)
    libmpi.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libmpi.so.40 (0x00007fdd3885d000)
    libopen-rte.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-rte.so.40 (0x00007fdd387a3000)
    libopen-pal.so.40 => /hits/basement/its/doserbd/easybuild/haswell/apps/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-pal.so.40 (0x00007fdd38690000)
    libucp.so.0 => not found
    libuct.so.0 => not found
    libucm.so.0 => not found
    libucs.so.0 => not found
    librt.so.1 => /lib64/librt.so.1 (0x00007fdd38488000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007fdd38285000)
    libhwloc.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/hwloc/1.11.12-GCCcore-8.3.0/lib/libhwloc.so.5 (0x00007fdd38243000)
    libnuma.so.1 => /hits/basement/its/doserbd/easybuild/haswell/apps/numactl/2.0.12-GCCcore-8.3.0/lib/libnuma.so.1 (0x00007fdd3878b000)
    libpciaccess.so.0 => /hits/basement/its/doserbd/easybuild/haswell/apps/libpciaccess/0.14-GCCcore-8.3.0/lib/libpciaccess.so.0 (0x00007fdd38781000)
    libxml2.so.2 => /hits/basement/its/doserbd/easybuild/haswell/apps/libxml2/2.9.9-GCCcore-8.3.0/lib/libxml2.so.2 (0x00007fdd380da000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007fdd37ed6000)
    libz.so.1 => /lib64/libz.so.1 (0x00007fdd37cc0000)
    liblzma.so.5 => /hits/basement/its/doserbd/easybuild/haswell/apps/XZ/5.2.4-GCCcore-8.3.0/lib/liblzma.so.5 (0x00007fdd37c99000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fdd37997000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdd3777b000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fdd373ae000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fdd38752000)

Could you please say me, how I can install the missing libraries?

yosefe commented 4 years ago

Well.. there are multiple options.. how did you install them on master node?

BerndDoser commented 4 years ago

Good question. I didn't install them by myself. But I think the issue is solved anyway. Thank you very much for your fast help.

BerndDoser commented 4 years ago

Now, I have installed OpenMPI without UCX support. The line

mca_base_component_repository_open: unable to open mca_pml_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)

has vanished, but the remaining error message is still there:

[haswell-103][[26717,1],10][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-103][[26717,1],12][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   haswell-103
  Local device: mlx4_0
--------------------------------------------------------------------------
[haswell-103][[26717,1],8][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-103][[26717,1],14][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success
[haswell-103][[26717,1],13][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success

Any ideas?

keenandr commented 4 years ago

We ran into this error with MOFED with upstream libraries instead of the Mellanox legacy libraries and a version of OpenMPI built against the legacy libraries.

rarensu commented 2 years ago

@keenandr over a year later this simple little statement helped me debug a similar error. Thank you.