Open richardbeare opened 5 months ago
Do you need to run with the OFI? provider ? The problem with the workaround is that your local communications are now using sockets instead of shared memory, leading to a significant performance penalty.
I assume shared memory communication is the default? How do I figure out why it is working differently on two machines that are supposedly identical OS's?
I don't think the OFI provider can handle two types of network simultaneously. I heard there is an ongoing effort to do so, but I'm not sure if it is part of any release. @hppritcha might have more info.
I think the OFI problem in #8305 is misleading you. Can you post the output of ompi_info
from both the laptop and cloud VM install?
I noticed that the ordering of some entries is different, but couldn't see anything beyond that.
Package: Open MPI richardb@duvel Distribution
Open MPI: 4.1.4
Open MPI repo revision: v4.1.4
Open MPI release date: May 26, 2022
Open RTE: 4.1.4
Open RTE repo revision: v4.1.4
Open RTE release date: May 26, 2022
OPAL: 4.1.4
OPAL repo revision: v4.1.4
OPAL release date: May 26, 2022
MPI API: 3.1.0
Ident string: 4.1.4
Prefix: /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0
Configured architecture: x86_64-pc-linux-gnu
Configure host: duvel
Configured by: richardb
Configured on: Sat Mar 30 11:19:09 UTC 2024
Configure host: duvel
Configure command line: '--prefix=/slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--with-cuda=internal' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-hwloc=/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0' '--with-libevent=/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0' '--with-ofi=/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0' '--with-pmix=/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0' '--with-ucx=/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0' '--with-ucc=/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0' '--without-verbs' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu' 'CC=gcc' 'CFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'LDFLAGS=-L/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/GCCcore/12.2.0/lib64 -L/slowdata/richardb/easybuild/software/GCCcore/12.2.0/lib' 'LIBS=-lm -lpthread' 'CPPFLAGS=-I/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/include' 'CXX=g++' 'CXXFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'FC=gfortran' 'FCFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'PKG_CONFIG_PATH=/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libpciaccess/0.17-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libxml2/2.10.3-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/XZ/5.2.7-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/numactl/2.0.16-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/OpenSSL/1.1/lib/pkgconfig:/slowdata/richardb/easybuild/software/libreadline/8.2-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/expat/2.4.9-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib/pkgconfig' '--no-create' '--no-recursion'
Built by: richardb
Built on: Sat 30 Mar 2024 11:25:55 UTC
Built host: duvel
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /slowdata/richardb/easybuild/software/GCCcore/12.2.0/bin/gcc
C compiler family name: GNU
C compiler version: 12.2.0
C++ compiler: g++
C++ compiler absolute: /slowdata/richardb/easybuild/software/GCCcore/12.2.0/bin/g++
Fort compiler: gfortran
Fort compiler abs: /slowdata/richardb/easybuild/software/GCCcore/12.2.0/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: no
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.4)
MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.4)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.4)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.4)
Package: Open MPI ubuntu@easybuild Distribution
Open MPI: 4.1.4
Open MPI repo revision: v4.1.4
Open MPI release date: May 26, 2022
Open RTE: 4.1.4
Open RTE repo revision: v4.1.4
Open RTE release date: May 26, 2022
OPAL: 4.1.4
OPAL repo revision: v4.1.4
OPAL release date: May 26, 2022
MPI API: 3.1.0
Ident string: 4.1.4
Prefix: /home/ubuntu/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0
Configured architecture: x86_64-pc-linux-gnu
Configure host: easybuild
Configured by: ubuntu
Configured on: Sat Mar 30 15:29:09 UTC 2024
Configure host: easybuild
Configure command line: '--prefix=/home/ubuntu/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--with-cuda=internal' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-hwloc=/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0' '--with-libevent=/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0' '--with-ofi=/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0' '--with-pmix=/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0' '--with-ucx=/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0' '--with-ucc=/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0' '--without-verbs' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu' 'CC=gcc' 'CFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'LDFLAGS=-L/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/lib' 'LIBS=-lm -lpthread' 'CPPFLAGS=-I/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/include' 'CXX=g++' 'CXXFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'FC=gfortran' 'FCFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'PKG_CONFIG_PATH=/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libpciaccess/0.17-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libxml2/2.10.3-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/XZ/5.2.7-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/numactl/2.0.16-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/OpenSSL/1.1/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libreadline/8.2-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/expat/2.4.9-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib/pkgconfig' '--no-create' '--no-recursion'
Built by: ubuntu
Built on: Sat 30 Mar 2024 15:35:46 UTC
Built host: easybuild
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/bin/gcc
C compiler family name: GNU
C compiler version: 12.2.0
C++ compiler: g++
C++ compiler absolute: /home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/bin/g++
Fort compiler: gfortran
Fort compiler abs: /home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: no
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.4)
MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.4)
MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.4)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.4)
MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.4)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.4)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.4)
Sorry for the delay in responding. I see you actually do have both ucx and ofi libfabric installed on the systems. To make sure we aren't trying to debug one of these could you rerun with
stuff preceding mpirun --verbose -n 8 --mca pml ob1 --mca btl self,vader /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
and see if the test runs on both systems?
On the laptop:
OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun -n 8 --mca pml ob1 --mca btl self,vader /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
On the VM
OMPI_MCA_rmaps_base_oversubscribe=1 ~/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun -n 8 --mca pml ob1 --mca btl self,vader ~/.local/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 2 exiting
Process 4 exiting
Process 5 exiting
Process 1 exiting
Process 3 exiting
Process 7 exiting
Process 6 exiting
Thanks. Now it would be useful to see if ucx transport works for both systems. Could you try first
stuff preceding mpirun --verbose -n 8 --mca pml ucx /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
and then
stuff preceding mpirun --verbose -n 8 --mca pml cm /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml ucx /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 1 exiting
Hangs ..
OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml cm /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
OMPI_MCA_rmaps_base_oversubscribe=1 ~/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml ucx ~/.local/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: easybuild
Framework: pml
--------------------------------------------------------------------------
[easybuild:210384] PML ucx cannot be selected
[easybuild:210373] PML ucx cannot be selected
[easybuild:210369] 2 more processes have sent help message help-mca-base.txt / find-available:none found
[easybuild:210369] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
OMPI_MCA_rmaps_base_oversubscribe=1 ~/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml cm ~/.local/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 5 exiting
Process 4 exiting
Process 6 exiting
Process 7 exiting
Thanks! This is pointing to an issue using OFI libfabric on your laptop. I'm not sure this is worth pursing further as we generally do not recommend using OFI libfabric on a single node. Nevertheless, to look further into this can you report which release of libfabric easybuild installed on your laptop? I'm assuming that's how OFI libfabric got installed on the laptop.
Thanks, There was also a system version of libfabric, installed as a dependency of a system vtk, I think. I had hoped that removing and rebuilding would solve the issue, but no luck.
However, I may have some clues.
The libfabric version installed by easy build is the same on both the laptop and cloud vm:
easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/libfabric.so
easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/libfabric.so.1.19.1
on VM
cat .local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig/libfabric.pc
prefix=/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
Name: libfabric
Description: OFI-WG libfabric
URL: https://github.com/ofiwg/libfabric.git
Version: 1.16.1
Requires:
Cflags: -I${includedir}
Libs: -L${libdir} -lfabric
Libs.private: -lrt -lnuma -libverbs -luuid -lefa -latomic -lpthread -ldl -lm
Requires.private:
on VM:
cat /slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig/libfabric.pc
prefix=/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
Name: libfabric
Description: OFI-WG libfabric
URL: https://github.com/ofiwg/libfabric.git
Version: 1.16.1
Requires:
Cflags: -I${includedir}
Libs: -L${libdir} -lfabric
Libs.private: -lrt -lnuma -libverbs -luuid -lefa -latomic -lpthread -ldl -lm -lcudart -lcuda
Requires.private:
So the laptop version is linking cuda stuff, which may explain the problems...
I'll see if I can figure out how to turn it off.
No luck, I'm afraid. I've added a --with-cuda=no
to the libfabric configuration, and cuda doesn't show up as a dependency in pkgconfig. Which comes back to the question as to why cm
is selected as the default on this laptop.
I now have another potential cause of the problem - the laptop has docker installed, which leads to a bunch of network interfaces being available. I notice that the fi_info
command is reporting entries like:
provider: tcp;ofi_rxm
fabric: 172.17.0.0/16
domain: docker0
version: 116.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 192.168.122.0/24
domain: virbr0
version: 116.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 192.168.0.0/24
domain: wlp59s0
version: 116.10
type: FI_EP_RDM
where the last one is the real wifi interface.
Finally, the ring test completes if I turn off wifi and plug in an ethernet cable, although there are warnings:
OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: duvel
Location: mtl_ofi_component.c:512
Error: Invalid argument (22)
--------------------------------------------------------------------------
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 4 exiting
Process 5 exiting
Process 3 exiting
Process 7 exiting
Process 6 exiting
[duvel:3056825] 7 more processes have sent help message help-mtl-ofi.txt / OFI call fail
[duvel:3056825] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
enable pml_base_verbose
by setting it to 50 and you shall get your answer.
On wifi:
OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml_base_verbose 50 /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
[duvel:3794600] mca: base: components_register: registering framework pml components
[duvel:3794600] mca: base: components_register: found loaded component v
[duvel:3794600] mca: base: components_register: component v register function successful
[duvel:3794600] mca: base: components_register: found loaded component monitoring
[duvel:3794600] mca: base: components_register: component monitoring register function successful
[duvel:3794600] mca: base: components_register: found loaded component cm
[duvel:3794600] mca: base: components_register: component cm register function successful
[duvel:3794600] mca: base: components_register: found loaded component ucx
[duvel:3794600] mca: base: components_register: component ucx register function successful
[duvel:3794600] mca: base: components_register: found loaded component ob1
[duvel:3794600] mca: base: components_register: component ob1 register function successful
[duvel:3794600] mca: base: components_open: opening pml components
[duvel:3794600] mca: base: components_open: found loaded component v
[duvel:3794600] mca: base: components_open: component v open function successful
[duvel:3794600] mca: base: components_open: found loaded component monitoring
[duvel:3794600] mca: base: components_open: component monitoring open function successful
[duvel:3794600] mca: base: components_open: found loaded component cm
[duvel:3794600] mca: base: components_open: component cm open function successful
[duvel:3794600] mca: base: components_open: found loaded component ucx
[duvel:3794600] mca: base: components_open: component ucx open function successful
[duvel:3794600] mca: base: components_open: found loaded component ob1
[duvel:3794600] mca: base: components_open: component ob1 open function successful
[duvel:3794600] select: component v not in the include list
[duvel:3794600] select: component monitoring not in the include list
[duvel:3794600] select: initializing pml component cm
[duvel:3794600] select: init returned priority 25
[duvel:3794600] select: initializing pml component ucx
[duvel:3794600] select: init returned priority 19
[duvel:3794600] select: initializing pml component ob1
[duvel:3794600] select: init returned priority 20
[duvel:3794600] selected cm best priority 25
[duvel:3794600] select: component cm selected
[duvel:3794600] select: component ucx not selected / finalized
[duvel:3794600] select: component ob1 not selected / finalized
[duvel:3794600] mca: base: close: component v closed
[duvel:3794600] mca: base: close: unloading component v
[duvel:3794600] mca: base: close: component monitoring closed
[duvel:3794600] mca: base: close: unloading component monitoring
[duvel:3794600] mca: base: close: component ucx closed
[duvel:3794600] mca: base: close: unloading component ucx
[duvel:3794600] mca: base: close: component ob1 closed
[duvel:3794600] mca: base: close: unloading component ob1
[duvel:3794599] mca: base: components_register: registering framework pml components
[duvel:3794599] mca: base: components_register: found loaded component v
[duvel:3794599] mca: base: components_register: component v register function successful
[duvel:3794599] mca: base: components_register: found loaded component monitoring
[duvel:3794599] mca: base: components_register: component monitoring register function successful
[duvel:3794599] mca: base: components_register: found loaded component cm
[duvel:3794599] mca: base: components_register: component cm register function successful
[duvel:3794599] mca: base: components_register: found loaded component ucx
[duvel:3794599] mca: base: components_register: component ucx register function successful
[duvel:3794599] mca: base: components_register: found loaded component ob1
[duvel:3794599] mca: base: components_register: component ob1 register function successful
[duvel:3794599] mca: base: components_open: opening pml components
[duvel:3794599] mca: base: components_open: found loaded component v
[duvel:3794599] mca: base: components_open: component v open function successful
[duvel:3794599] mca: base: components_open: found loaded component monitoring
[duvel:3794599] mca: base: components_open: component monitoring open function successful
[duvel:3794599] mca: base: components_open: found loaded component cm
[duvel:3794599] mca: base: components_open: component cm open function successful
[duvel:3794599] mca: base: components_open: found loaded component ucx
[duvel:3794598] mca: base: components_register: registering framework pml components
[duvel:3794598] mca: base: components_register: found loaded component v
[duvel:3794598] mca: base: components_register: component v register function successful
[duvel:3794598] mca: base: components_register: found loaded component monitoring
[duvel:3794598] mca: base: components_register: component monitoring register function successful
[duvel:3794598] mca: base: components_register: found loaded component cm
[duvel:3794598] mca: base: components_register: component cm register function successful
[duvel:3794598] mca: base: components_register: found loaded component ucx
[duvel:3794598] mca: base: components_register: component ucx register function successful
[duvel:3794598] mca: base: components_register: found loaded component ob1
[duvel:3794598] mca: base: components_register: component ob1 register function successful
[duvel:3794598] mca: base: components_open: opening pml components
[duvel:3794598] mca: base: components_open: found loaded component v
[duvel:3794598] mca: base: components_open: component v open function successful
[duvel:3794598] mca: base: components_open: found loaded component monitoring
[duvel:3794598] mca: base: components_open: component monitoring open function successful
[duvel:3794598] mca: base: components_open: found loaded component cm
[duvel:3794598] mca: base: components_open: component cm open function successful
[duvel:3794598] mca: base: components_open: found loaded component ucx
[duvel:3794602] mca: base: components_register: registering framework pml components
[duvel:3794602] mca: base: components_register: found loaded component v
[duvel:3794602] mca: base: components_register: component v register function successful
[duvel:3794602] mca: base: components_register: found loaded component monitoring
[duvel:3794602] mca: base: components_register: component monitoring register function successful
[duvel:3794602] mca: base: components_register: found loaded component cm
[duvel:3794602] mca: base: components_register: component cm register function successful
[duvel:3794602] mca: base: components_register: found loaded component ucx
[duvel:3794602] mca: base: components_register: component ucx register function successful
[duvel:3794602] mca: base: components_register: found loaded component ob1
[duvel:3794597] mca: base: components_register: registering framework pml components
[duvel:3794597] mca: base: components_register: found loaded component v
[duvel:3794597] mca: base: components_register: component v register function successful
[duvel:3794597] mca: base: components_register: found loaded component monitoring
[duvel:3794602] mca: base: components_register: component ob1 register function successful
[duvel:3794602] mca: base: components_open: opening pml components
[duvel:3794602] mca: base: components_open: found loaded component v
[duvel:3794597] mca: base: components_register: component monitoring register function successful
[duvel:3794597] mca: base: components_register: found loaded component cm
[duvel:3794597] mca: base: components_register: component cm register function successful
[duvel:3794597] mca: base: components_register: found loaded component ucx
[duvel:3794597] mca: base: components_register: component ucx register function successful
[duvel:3794602] mca: base: components_open: component v open function successful
[duvel:3794602] mca: base: components_open: found loaded component monitoring
[duvel:3794602] mca: base: components_open: component monitoring open function successful
[duvel:3794602] mca: base: components_open: found loaded component cm
[duvel:3794597] mca: base: components_register: found loaded component ob1
[duvel:3794597] mca: base: components_register: component ob1 register function successful
[duvel:3794597] mca: base: components_open: opening pml components
[duvel:3794597] mca: base: components_open: found loaded component v
[duvel:3794602] mca: base: components_open: component cm open function successful
[duvel:3794602] mca: base: components_open: found loaded component ucx
[duvel:3794597] mca: base: components_open: component v open function successful
[duvel:3794597] mca: base: components_open: found loaded component monitoring
[duvel:3794597] mca: base: components_open: component monitoring open function successful
[duvel:3794597] mca: base: components_open: found loaded component cm
[duvel:3794597] mca: base: components_open: component cm open function successful
[duvel:3794597] mca: base: components_open: found loaded component ucx
[duvel:3794601] mca: base: components_register: registering framework pml components
[duvel:3794601] mca: base: components_register: found loaded component v
[duvel:3794601] mca: base: components_register: component v register function successful
[duvel:3794601] mca: base: components_register: found loaded component monitoring
[duvel:3794601] mca: base: components_register: component monitoring register function successful
[duvel:3794601] mca: base: components_register: found loaded component cm
[duvel:3794601] mca: base: components_register: component cm register function successful
[duvel:3794601] mca: base: components_register: found loaded component ucx
[duvel:3794601] mca: base: components_register: component ucx register function successful
[duvel:3794601] mca: base: components_register: found loaded component ob1
[duvel:3794601] mca: base: components_register: component ob1 register function successful
[duvel:3794601] mca: base: components_open: opening pml components
[duvel:3794601] mca: base: components_open: found loaded component v
[duvel:3794601] mca: base: components_open: component v open function successful
[duvel:3794601] mca: base: components_open: found loaded component monitoring
[duvel:3794601] mca: base: components_open: component monitoring open function successful
[duvel:3794601] mca: base: components_open: found loaded component cm
[duvel:3794601] mca: base: components_open: component cm open function successful
[duvel:3794601] mca: base: components_open: found loaded component ucx
[duvel:3794595] mca: base: components_register: registering framework pml components
[duvel:3794595] mca: base: components_register: found loaded component v
[duvel:3794595] mca: base: components_register: component v register function successful
[duvel:3794599] mca: base: components_open: component ucx open function successful
[duvel:3794599] mca: base: components_open: found loaded component ob1
[duvel:3794599] mca: base: components_open: component ob1 open function successful
[duvel:3794595] mca: base: components_register: found loaded component monitoring
[duvel:3794595] mca: base: components_register: component monitoring register function successful
[duvel:3794595] mca: base: components_register: found loaded component cm
[duvel:3794595] mca: base: components_register: component cm register function successful
[duvel:3794595] mca: base: components_register: found loaded component ucx
[duvel:3794595] mca: base: components_register: component ucx register function successful
[duvel:3794595] mca: base: components_register: found loaded component ob1
[duvel:3794595] mca: base: components_register: component ob1 register function successful
[duvel:3794595] mca: base: components_open: opening pml components
[duvel:3794595] mca: base: components_open: found loaded component v
[duvel:3794596] mca: base: components_register: registering framework pml components
[duvel:3794596] mca: base: components_register: found loaded component v
[duvel:3794596] mca: base: components_register: component v register function successful
[duvel:3794596] mca: base: components_register: found loaded component monitoring
[duvel:3794595] mca: base: components_open: component v open function successful
[duvel:3794595] mca: base: components_open: found loaded component monitoring
[duvel:3794595] mca: base: components_open: component monitoring open function successful
[duvel:3794595] mca: base: components_open: found loaded component cm
[duvel:3794596] mca: base: components_register: component monitoring register function successful
[duvel:3794596] mca: base: components_register: found loaded component cm
[duvel:3794596] mca: base: components_register: component cm register function successful
[duvel:3794596] mca: base: components_register: found loaded component ucx
[duvel:3794595] mca: base: components_open: component cm open function successful
[duvel:3794595] mca: base: components_open: found loaded component ucx
[duvel:3794596] mca: base: components_register: component ucx register function successful
[duvel:3794596] mca: base: components_register: found loaded component ob1
[duvel:3794596] mca: base: components_register: component ob1 register function successful
[duvel:3794596] mca: base: components_open: opening pml components
[duvel:3794596] mca: base: components_open: found loaded component v
[duvel:3794596] mca: base: components_open: component v open function successful
[duvel:3794596] mca: base: components_open: found loaded component monitoring
[duvel:3794596] mca: base: components_open: component monitoring open function successful
[duvel:3794596] mca: base: components_open: found loaded component cm
[duvel:3794596] mca: base: components_open: component cm open function successful
[duvel:3794596] mca: base: components_open: found loaded component ucx
[duvel:3794602] mca: base: components_open: component ucx open function successful
[duvel:3794602] mca: base: components_open: found loaded component ob1
[duvel:3794602] mca: base: components_open: component ob1 open function successful
[duvel:3794597] mca: base: components_open: component ucx open function successful
[duvel:3794597] mca: base: components_open: found loaded component ob1
[duvel:3794597] mca: base: components_open: component ob1 open function successful
[duvel:3794598] mca: base: components_open: component ucx open function successful
[duvel:3794598] mca: base: components_open: found loaded component ob1
[duvel:3794598] mca: base: components_open: component ob1 open function successful
[duvel:3794599] select: component v not in the include list
[duvel:3794599] select: component monitoring not in the include list
[duvel:3794599] select: initializing pml component cm
[duvel:3794597] select: component v not in the include list
[duvel:3794597] select: component monitoring not in the include list
[duvel:3794597] select: initializing pml component cm
[duvel:3794602] select: component v not in the include list
[duvel:3794602] select: component monitoring not in the include list
[duvel:3794602] select: initializing pml component cm
[duvel:3794599] select: init returned priority 25
[duvel:3794599] select: initializing pml component ucx
[duvel:3794601] mca: base: components_open: component ucx open function successful
[duvel:3794601] mca: base: components_open: found loaded component ob1
[duvel:3794601] mca: base: components_open: component ob1 open function successful
[duvel:3794597] select: init returned priority 25
[duvel:3794597] select: initializing pml component ucx
[duvel:3794602] select: init returned priority 25
[duvel:3794602] select: initializing pml component ucx
[duvel:3794598] select: component v not in the include list
[duvel:3794598] select: component monitoring not in the include list
[duvel:3794598] select: initializing pml component cm
[duvel:3794595] mca: base: components_open: component ucx open function successful
[duvel:3794595] mca: base: components_open: found loaded component ob1
[duvel:3794595] mca: base: components_open: component ob1 open function successful
[duvel:3794596] mca: base: components_open: component ucx open function successful
[duvel:3794596] mca: base: components_open: found loaded component ob1
[duvel:3794596] mca: base: components_open: component ob1 open function successful
[duvel:3794598] select: init returned priority 25
[duvel:3794598] select: initializing pml component ucx
[duvel:3794595] select: component v not in the include list
[duvel:3794595] select: component monitoring not in the include list
[duvel:3794595] select: initializing pml component cm
[duvel:3794601] select: component v not in the include list
[duvel:3794601] select: component monitoring not in the include list
[duvel:3794601] select: initializing pml component cm
[duvel:3794595] select: init returned priority 25
[duvel:3794595] select: initializing pml component ucx
[duvel:3794596] select: component v not in the include list
[duvel:3794596] select: component monitoring not in the include list
[duvel:3794596] select: initializing pml component cm
[duvel:3794596] select: init returned priority 25
[duvel:3794596] select: initializing pml component ucx
[duvel:3794601] select: init returned priority 25
[duvel:3794601] select: initializing pml component ucx
[duvel:3794599] select: init returned priority 19
[duvel:3794599] select: initializing pml component ob1
[duvel:3794599] select: init returned priority 20
[duvel:3794599] selected cm best priority 25
[duvel:3794599] select: component cm selected
[duvel:3794602] select: init returned priority 19
[duvel:3794602] select: initializing pml component ob1
[duvel:3794602] select: init returned priority 20
[duvel:3794602] selected cm best priority 25
[duvel:3794602] select: component cm selected
[duvel:3794597] select: init returned priority 19
[duvel:3794597] select: initializing pml component ob1
[duvel:3794597] select: init returned priority 20
[duvel:3794597] selected cm best priority 25
[duvel:3794597] select: component cm selected
[duvel:3794599] select: component ucx not selected / finalized
[duvel:3794599] select: component ob1 not selected / finalized
[duvel:3794599] mca: base: close: component v closed
[duvel:3794599] mca: base: close: unloading component v
[duvel:3794599] mca: base: close: component monitoring closed
[duvel:3794599] mca: base: close: unloading component monitoring
[duvel:3794599] mca: base: close: component ucx closed
[duvel:3794599] mca: base: close: unloading component ucx
[duvel:3794599] mca: base: close: component ob1 closed
[duvel:3794599] mca: base: close: unloading component ob1
[duvel:3794602] select: component ucx not selected / finalized
[duvel:3794602] select: component ob1 not selected / finalized
[duvel:3794602] mca: base: close: component v closed
[duvel:3794602] mca: base: close: unloading component v
[duvel:3794602] mca: base: close: component monitoring closed
[duvel:3794602] mca: base: close: unloading component monitoring
[duvel:3794597] select: component ucx not selected / finalized
[duvel:3794597] select: component ob1 not selected / finalized
[duvel:3794597] mca: base: close: component v closed
[duvel:3794597] mca: base: close: unloading component v
[duvel:3794597] mca: base: close: component monitoring closed
[duvel:3794597] mca: base: close: unloading component monitoring
[duvel:3794602] mca: base: close: component ucx closed
[duvel:3794602] mca: base: close: unloading component ucx
[duvel:3794602] mca: base: close: component ob1 closed
[duvel:3794602] mca: base: close: unloading component ob1
[duvel:3794597] mca: base: close: component ucx closed
[duvel:3794597] mca: base: close: unloading component ucx
[duvel:3794597] mca: base: close: component ob1 closed
[duvel:3794597] mca: base: close: unloading component ob1
[duvel:3794595] select: init returned priority 19
[duvel:3794595] select: initializing pml component ob1
[duvel:3794595] select: init returned priority 20
[duvel:3794595] selected cm best priority 25
[duvel:3794595] select: component cm selected
[duvel:3794598] select: init returned priority 19
[duvel:3794598] select: initializing pml component ob1
[duvel:3794598] select: init returned priority 20
[duvel:3794598] selected cm best priority 25
[duvel:3794598] select: component cm selected
[duvel:3794596] select: init returned priority 19
[duvel:3794596] select: initializing pml component ob1
[duvel:3794596] select: init returned priority 20
[duvel:3794596] selected cm best priority 25
[duvel:3794596] select: component cm selected
[duvel:3794595] select: component ucx not selected / finalized
[duvel:3794595] select: component ob1 not selected / finalized
[duvel:3794595] mca: base: close: component v closed
[duvel:3794595] mca: base: close: unloading component v
[duvel:3794595] mca: base: close: component monitoring closed
[duvel:3794595] mca: base: close: unloading component monitoring
[duvel:3794595] mca: base: close: component ucx closed
[duvel:3794595] mca: base: close: unloading component ucx
[duvel:3794595] mca: base: close: component ob1 closed
[duvel:3794595] mca: base: close: unloading component ob1
[duvel:3794596] select: component ucx not selected / finalized
[duvel:3794596] select: component ob1 not selected / finalized
[duvel:3794596] mca: base: close: component v closed
[duvel:3794596] mca: base: close: unloading component v
[duvel:3794596] mca: base: close: component monitoring closed
[duvel:3794596] mca: base: close: unloading component monitoring
[duvel:3794598] select: component ucx not selected / finalized
[duvel:3794598] select: component ob1 not selected / finalized
[duvel:3794598] mca: base: close: component v closed
[duvel:3794598] mca: base: close: unloading component v
[duvel:3794596] mca: base: close: component ucx closed
[duvel:3794596] mca: base: close: unloading component ucx
[duvel:3794598] mca: base: close: component monitoring closed
[duvel:3794598] mca: base: close: unloading component monitoring
[duvel:3794596] mca: base: close: component ob1 closed
[duvel:3794596] mca: base: close: unloading component ob1
[duvel:3794598] mca: base: close: component ucx closed
[duvel:3794598] mca: base: close: unloading component ucx
[duvel:3794598] mca: base: close: component ob1 closed
[duvel:3794598] mca: base: close: unloading component ob1
[duvel:3794601] select: init returned priority 19
[duvel:3794601] select: initializing pml component ob1
[duvel:3794601] select: init returned priority 20
[duvel:3794601] selected cm best priority 25
[duvel:3794601] select: component cm selected
[duvel:3794601] select: component ucx not selected / finalized
[duvel:3794601] select: component ob1 not selected / finalized
[duvel:3794601] mca: base: close: component v closed
[duvel:3794601] mca: base: close: unloading component v
[duvel:3794601] mca: base: close: component monitoring closed
[duvel:3794601] mca: base: close: unloading component monitoring
[duvel:3794601] mca: base: close: component ucx closed
[duvel:3794601] mca: base: close: unloading component ucx
[duvel:3794601] mca: base: close: component ob1 closed
[duvel:3794601] mca: base: close: unloading component ob1
[duvel:3794601] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794595] check:select: PML check not necessary on self
[duvel:3794600] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794597] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794602] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794599] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794596] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794598] check:select: checking my pml cm against process [[61709,1],0] pml cm
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
on wired network
OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml_base_verbose 50 /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
[duvel:3794239] mca: base: components_register: registering framework pml components
[duvel:3794239] mca: base: components_register: found loaded component v
[duvel:3794239] mca: base: components_register: component v register function successful
[duvel:3794239] mca: base: components_register: found loaded component monitoring
[duvel:3794239] mca: base: components_register: component monitoring register function successful
[duvel:3794239] mca: base: components_register: found loaded component cm
[duvel:3794239] mca: base: components_register: component cm register function successful
[duvel:3794239] mca: base: components_register: found loaded component ucx
[duvel:3794239] mca: base: components_register: component ucx register function successful
[duvel:3794239] mca: base: components_register: found loaded component ob1
[duvel:3794239] mca: base: components_register: component ob1 register function successful
[duvel:3794239] mca: base: components_open: opening pml components
[duvel:3794239] mca: base: components_open: found loaded component v
[duvel:3794239] mca: base: components_open: component v open function successful
[duvel:3794239] mca: base: components_open: found loaded component monitoring
[duvel:3794239] mca: base: components_open: component monitoring open function successful
[duvel:3794239] mca: base: components_open: found loaded component cm
[duvel:3794239] mca: base: components_open: component cm open function successful
[duvel:3794239] mca: base: components_open: found loaded component ucx
[duvel:3794241] mca: base: components_register: registering framework pml components
[duvel:3794241] mca: base: components_register: found loaded component v
[duvel:3794241] mca: base: components_register: component v register function successful
[duvel:3794241] mca: base: components_register: found loaded component monitoring
[duvel:3794241] mca: base: components_register: component monitoring register function successful
[duvel:3794241] mca: base: components_register: found loaded component cm
[duvel:3794241] mca: base: components_register: component cm register function successful
[duvel:3794237] mca: base: components_register: registering framework pml components
[duvel:3794237] mca: base: components_register: found loaded component v
[duvel:3794241] mca: base: components_register: found loaded component ucx
[duvel:3794237] mca: base: components_register: component v register function successful
[duvel:3794237] mca: base: components_register: found loaded component monitoring
[duvel:3794237] mca: base: components_register: component monitoring register function successful
[duvel:3794237] mca: base: components_register: found loaded component cm
[duvel:3794237] mca: base: components_register: component cm register function successful
[duvel:3794241] mca: base: components_register: component ucx register function successful
[duvel:3794241] mca: base: components_register: found loaded component ob1
[duvel:3794237] mca: base: components_register: found loaded component ucx
[duvel:3794241] mca: base: components_register: component ob1 register function successful
[duvel:3794241] mca: base: components_open: opening pml components
[duvel:3794241] mca: base: components_open: found loaded component v
[duvel:3794237] mca: base: components_register: component ucx register function successful
[duvel:3794237] mca: base: components_register: found loaded component ob1
[duvel:3794237] mca: base: components_register: component ob1 register function successful
[duvel:3794237] mca: base: components_open: opening pml components
[duvel:3794237] mca: base: components_open: found loaded component v
[duvel:3794241] mca: base: components_open: component v open function successful
[duvel:3794241] mca: base: components_open: found loaded component monitoring
[duvel:3794241] mca: base: components_open: component monitoring open function successful
[duvel:3794241] mca: base: components_open: found loaded component cm
[duvel:3794235] mca: base: components_register: registering framework pml components
[duvel:3794235] mca: base: components_register: found loaded component v
[duvel:3794235] mca: base: components_register: component v register function successful
[duvel:3794235] mca: base: components_register: found loaded component monitoring
[duvel:3794235] mca: base: components_register: component monitoring register function successful
[duvel:3794235] mca: base: components_register: found loaded component cm
[duvel:3794238] mca: base: components_register: registering framework pml components
[duvel:3794238] mca: base: components_register: found loaded component v
[duvel:3794238] mca: base: components_register: component v register function successful
[duvel:3794235] mca: base: components_register: component cm register function successful
[duvel:3794235] mca: base: components_register: found loaded component ucx
[duvel:3794238] mca: base: components_register: found loaded component monitoring
[duvel:3794238] mca: base: components_register: component monitoring register function successful
[duvel:3794237] mca: base: components_open: component v open function successful
[duvel:3794237] mca: base: components_open: found loaded component monitoring
[duvel:3794237] mca: base: components_open: component monitoring open function successful
[duvel:3794237] mca: base: components_open: found loaded component cm
[duvel:3794238] mca: base: components_register: found loaded component cm
[duvel:3794235] mca: base: components_register: component ucx register function successful
[duvel:3794238] mca: base: components_register: component cm register function successful
[duvel:3794238] mca: base: components_register: found loaded component ucx
[duvel:3794235] mca: base: components_register: found loaded component ob1
[duvel:3794235] mca: base: components_register: component ob1 register function successful
[duvel:3794241] mca: base: components_open: component cm open function successful
[duvel:3794241] mca: base: components_open: found loaded component ucx
[duvel:3794238] mca: base: components_register: component ucx register function successful
[duvel:3794235] mca: base: components_open: opening pml components
[duvel:3794235] mca: base: components_open: found loaded component v
[duvel:3794238] mca: base: components_register: found loaded component ob1
[duvel:3794238] mca: base: components_register: component ob1 register function successful
[duvel:3794238] mca: base: components_open: opening pml components
[duvel:3794238] mca: base: components_open: found loaded component v
[duvel:3794237] mca: base: components_open: component cm open function successful
[duvel:3794237] mca: base: components_open: found loaded component ucx
[duvel:3794235] mca: base: components_open: component v open function successful
[duvel:3794235] mca: base: components_open: found loaded component monitoring
[duvel:3794235] mca: base: components_open: component monitoring open function successful
[duvel:3794235] mca: base: components_open: found loaded component cm
[duvel:3794238] mca: base: components_open: component v open function successful
[duvel:3794238] mca: base: components_open: found loaded component monitoring
[duvel:3794238] mca: base: components_open: component monitoring open function successful
[duvel:3794238] mca: base: components_open: found loaded component cm
[duvel:3794234] mca: base: components_register: registering framework pml components
[duvel:3794234] mca: base: components_register: found loaded component v
[duvel:3794234] mca: base: components_register: component v register function successful
[duvel:3794234] mca: base: components_register: found loaded component monitoring
[duvel:3794235] mca: base: components_open: component cm open function successful
[duvel:3794235] mca: base: components_open: found loaded component ucx
[duvel:3794234] mca: base: components_register: component monitoring register function successful
[duvel:3794234] mca: base: components_register: found loaded component cm
[duvel:3794234] mca: base: components_register: component cm register function successful
[duvel:3794238] mca: base: components_open: component cm open function successful
[duvel:3794238] mca: base: components_open: found loaded component ucx
[duvel:3794234] mca: base: components_register: found loaded component ucx
[duvel:3794234] mca: base: components_register: component ucx register function successful
[duvel:3794234] mca: base: components_register: found loaded component ob1
[duvel:3794240] mca: base: components_register: registering framework pml components
[duvel:3794240] mca: base: components_register: found loaded component v
[duvel:3794234] mca: base: components_register: component ob1 register function successful
[duvel:3794240] mca: base: components_register: component v register function successful
[duvel:3794234] mca: base: components_open: opening pml components
[duvel:3794234] mca: base: components_open: found loaded component v
[duvel:3794240] mca: base: components_register: found loaded component monitoring
[duvel:3794240] mca: base: components_register: component monitoring register function successful
[duvel:3794240] mca: base: components_register: found loaded component cm
[duvel:3794240] mca: base: components_register: component cm register function successful
[duvel:3794240] mca: base: components_register: found loaded component ucx
[duvel:3794240] mca: base: components_register: component ucx register function successful
[duvel:3794240] mca: base: components_register: found loaded component ob1
[duvel:3794240] mca: base: components_register: component ob1 register function successful
[duvel:3794234] mca: base: components_open: component v open function successful
[duvel:3794234] mca: base: components_open: found loaded component monitoring
[duvel:3794234] mca: base: components_open: component monitoring open function successful
[duvel:3794234] mca: base: components_open: found loaded component cm
[duvel:3794240] mca: base: components_open: opening pml components
[duvel:3794240] mca: base: components_open: found loaded component v
[duvel:3794240] mca: base: components_open: component v open function successful
[duvel:3794240] mca: base: components_open: found loaded component monitoring
[duvel:3794240] mca: base: components_open: component monitoring open function successful
[duvel:3794240] mca: base: components_open: found loaded component cm
[duvel:3794234] mca: base: components_open: component cm open function successful
[duvel:3794234] mca: base: components_open: found loaded component ucx
[duvel:3794240] mca: base: components_open: component cm open function successful
[duvel:3794240] mca: base: components_open: found loaded component ucx
[duvel:3794236] mca: base: components_register: registering framework pml components
[duvel:3794236] mca: base: components_register: found loaded component v
[duvel:3794236] mca: base: components_register: component v register function successful
[duvel:3794236] mca: base: components_register: found loaded component monitoring
[duvel:3794236] mca: base: components_register: component monitoring register function successful
[duvel:3794236] mca: base: components_register: found loaded component cm
[duvel:3794236] mca: base: components_register: component cm register function successful
[duvel:3794236] mca: base: components_register: found loaded component ucx
[duvel:3794236] mca: base: components_register: component ucx register function successful
[duvel:3794236] mca: base: components_register: found loaded component ob1
[duvel:3794236] mca: base: components_register: component ob1 register function successful
[duvel:3794236] mca: base: components_open: opening pml components
[duvel:3794236] mca: base: components_open: found loaded component v
[duvel:3794236] mca: base: components_open: component v open function successful
[duvel:3794236] mca: base: components_open: found loaded component monitoring
[duvel:3794236] mca: base: components_open: component monitoring open function successful
[duvel:3794236] mca: base: components_open: found loaded component cm
[duvel:3794236] mca: base: components_open: component cm open function successful
[duvel:3794236] mca: base: components_open: found loaded component ucx
[duvel:3794239] mca: base: components_open: component ucx open function successful
[duvel:3794239] mca: base: components_open: found loaded component ob1
[duvel:3794239] mca: base: components_open: component ob1 open function successful
[duvel:3794238] mca: base: components_open: component ucx open function successful
[duvel:3794238] mca: base: components_open: found loaded component ob1
[duvel:3794238] mca: base: components_open: component ob1 open function successful
[duvel:3794234] mca: base: components_open: component ucx open function successful
[duvel:3794234] mca: base: components_open: found loaded component ob1
[duvel:3794234] mca: base: components_open: component ob1 open function successful
[duvel:3794235] mca: base: components_open: component ucx open function successful
[duvel:3794235] mca: base: components_open: found loaded component ob1
[duvel:3794235] mca: base: components_open: component ob1 open function successful
[duvel:3794240] mca: base: components_open: component ucx open function successful
[duvel:3794240] mca: base: components_open: found loaded component ob1
[duvel:3794240] mca: base: components_open: component ob1 open function successful
[duvel:3794241] mca: base: components_open: component ucx open function successful
[duvel:3794241] mca: base: components_open: found loaded component ob1
[duvel:3794241] mca: base: components_open: component ob1 open function successful
[duvel:3794237] mca: base: components_open: component ucx open function successful
[duvel:3794237] mca: base: components_open: found loaded component ob1
[duvel:3794237] mca: base: components_open: component ob1 open function successful
[duvel:3794239] select: component v not in the include list
[duvel:3794239] select: component monitoring not in the include list
[duvel:3794239] select: initializing pml component cm
[duvel:3794238] select: component v not in the include list
[duvel:3794238] select: component monitoring not in the include list
[duvel:3794238] select: initializing pml component cm
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: duvel
Location: mtl_ofi_component.c:512
Error: Invalid argument (22)
--------------------------------------------------------------------------
[duvel:3794235] select: component v not in the include list
[duvel:3794235] select: component monitoring not in the include list
[duvel:3794235] select: initializing pml component cm
[duvel:3794234] select: component v not in the include list
[duvel:3794234] select: component monitoring not in the include list
[duvel:3794234] select: initializing pml component cm
[duvel:3794240] select: component v not in the include list
[duvel:3794241] select: component v not in the include list
[duvel:3794241] select: component monitoring not in the include list
[duvel:3794241] select: initializing pml component cm
[duvel:3794240] select: component monitoring not in the include list
[duvel:3794240] select: initializing pml component cm
[duvel:3794237] select: component v not in the include list
[duvel:3794237] select: component monitoring not in the include list
[duvel:3794237] select: initializing pml component cm
[duvel:3794239] select: init returned failure for component cm
[duvel:3794239] select: initializing pml component ucx
[duvel:3794238] select: init returned failure for component cm
[duvel:3794238] select: initializing pml component ucx
[duvel:3794236] mca: base: components_open: component ucx open function successful
[duvel:3794236] mca: base: components_open: found loaded component ob1
[duvel:3794236] mca: base: components_open: component ob1 open function successful
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
[duvel:3794235] select: init returned failure for component cm
[duvel:3794234] select: init returned failure for component cm
[duvel:3794234] select: initializing pml component ucx
[duvel:3794235] select: initializing pml component ucx
[duvel:3794237] select: init returned failure for component cm
[duvel:3794237] select: initializing pml component ucx
[duvel:3794241] select: init returned failure for component cm
[duvel:3794241] select: initializing pml component ucx
[duvel:3794240] select: init returned failure for component cm
[duvel:3794240] select: initializing pml component ucx
[duvel:3794236] select: component v not in the include list
[duvel:3794236] select: component monitoring not in the include list
[duvel:3794236] select: initializing pml component cm
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
[duvel:3794236] select: init returned failure for component cm
[duvel:3794236] select: initializing pml component ucx
[duvel:3794239] select: init returned priority 19
[duvel:3794239] select: initializing pml component ob1
[duvel:3794239] select: init returned priority 20
[duvel:3794239] selected ob1 best priority 20
[duvel:3794239] select: component ob1 selected
[duvel:3794238] select: init returned priority 19
[duvel:3794238] select: initializing pml component ob1
[duvel:3794238] select: init returned priority 20
[duvel:3794238] selected ob1 best priority 20
[duvel:3794238] select: component ob1 selected
[duvel:3794239] select: component ucx not selected / finalized
[duvel:3794239] mca: base: close: component v closed
[duvel:3794239] mca: base: close: unloading component v
[duvel:3794239] mca: base: close: component monitoring closed
[duvel:3794239] mca: base: close: unloading component monitoring
[duvel:3794239] mca: base: close: component cm closed
[duvel:3794239] mca: base: close: unloading component cm
[duvel:3794239] mca: base: close: component ucx closed
[duvel:3794239] mca: base: close: unloading component ucx
[duvel:3794238] select: component ucx not selected / finalized
[duvel:3794238] mca: base: close: component v closed
[duvel:3794238] mca: base: close: unloading component v
[duvel:3794238] mca: base: close: component monitoring closed
[duvel:3794238] mca: base: close: unloading component monitoring
[duvel:3794238] mca: base: close: component cm closed
[duvel:3794238] mca: base: close: unloading component cm
[duvel:3794238] mca: base: close: component ucx closed
[duvel:3794238] mca: base: close: unloading component ucx
[duvel:3794234] select: init returned priority 19
[duvel:3794234] select: initializing pml component ob1
[duvel:3794234] select: init returned priority 20
[duvel:3794234] selected ob1 best priority 20
[duvel:3794234] select: component ob1 selected
[duvel:3794234] select: component ucx not selected / finalized
[duvel:3794234] mca: base: close: component v closed
[duvel:3794234] mca: base: close: unloading component v
[duvel:3794234] mca: base: close: component monitoring closed
[duvel:3794234] mca: base: close: unloading component monitoring
[duvel:3794234] mca: base: close: component cm closed
[duvel:3794234] mca: base: close: unloading component cm
[duvel:3794234] mca: base: close: component ucx closed
[duvel:3794234] mca: base: close: unloading component ucx
[duvel:3794241] select: init returned priority 19
[duvel:3794241] select: initializing pml component ob1
[duvel:3794241] select: init returned priority 20
[duvel:3794241] selected ob1 best priority 20
[duvel:3794241] select: component ob1 selected
[duvel:3794240] select: init returned priority 19
[duvel:3794240] select: initializing pml component ob1
[duvel:3794240] select: init returned priority 20
[duvel:3794240] selected ob1 best priority 20
[duvel:3794240] select: component ob1 selected
[duvel:3794237] select: init returned priority 19
[duvel:3794237] select: initializing pml component ob1
[duvel:3794237] select: init returned priority 20
[duvel:3794237] selected ob1 best priority 20
[duvel:3794237] select: component ob1 selected
[duvel:3794235] select: init returned priority 19
[duvel:3794235] select: initializing pml component ob1
[duvel:3794235] select: init returned priority 20
[duvel:3794235] selected ob1 best priority 20
[duvel:3794235] select: component ob1 selected
[duvel:3794236] select: init returned priority 19
[duvel:3794236] select: initializing pml component ob1
[duvel:3794236] select: init returned priority 20
[duvel:3794236] selected ob1 best priority 20
[duvel:3794236] select: component ob1 selected
[duvel:3794236] select: component ucx not selected / finalized
[duvel:3794236] mca: base: close: component v closed
[duvel:3794236] mca: base: close: unloading component v
[duvel:3794236] mca: base: close: component monitoring closed
[duvel:3794236] mca: base: close: unloading component monitoring
[duvel:3794236] mca: base: close: component cm closed
[duvel:3794236] mca: base: close: unloading component cm
[duvel:3794236] mca: base: close: component ucx closed
[duvel:3794236] mca: base: close: unloading component ucx
[duvel:3794241] select: component ucx not selected / finalized
[duvel:3794241] mca: base: close: component v closed
[duvel:3794241] mca: base: close: unloading component v
[duvel:3794241] mca: base: close: component monitoring closed
[duvel:3794241] mca: base: close: unloading component monitoring
[duvel:3794241] mca: base: close: component cm closed
[duvel:3794241] mca: base: close: unloading component cm
[duvel:3794240] select: component ucx not selected / finalized
[duvel:3794240] mca: base: close: component v closed
[duvel:3794240] mca: base: close: unloading component v
[duvel:3794240] mca: base: close: component monitoring closed
[duvel:3794240] mca: base: close: unloading component monitoring
[duvel:3794237] select: component ucx not selected / finalized
[duvel:3794237] mca: base: close: component v closed
[duvel:3794237] mca: base: close: unloading component v
[duvel:3794235] select: component ucx not selected / finalized
[duvel:3794235] mca: base: close: component v closed
[duvel:3794235] mca: base: close: unloading component v
[duvel:3794237] mca: base: close: component monitoring closed
[duvel:3794237] mca: base: close: unloading component monitoring
[duvel:3794235] mca: base: close: component monitoring closed
[duvel:3794235] mca: base: close: unloading component monitoring
[duvel:3794240] mca: base: close: component cm closed
[duvel:3794240] mca: base: close: unloading component cm
[duvel:3794241] mca: base: close: component ucx closed
[duvel:3794241] mca: base: close: unloading component ucx
[duvel:3794237] mca: base: close: component cm closed
[duvel:3794237] mca: base: close: unloading component cm
[duvel:3794235] mca: base: close: component cm closed
[duvel:3794235] mca: base: close: unloading component cm
[duvel:3794240] mca: base: close: component ucx closed
[duvel:3794240] mca: base: close: unloading component ucx
[duvel:3794237] mca: base: close: component ucx closed
[duvel:3794237] mca: base: close: unloading component ucx
[duvel:3794235] mca: base: close: component ucx closed
[duvel:3794235] mca: base: close: unloading component ucx
[duvel:3794240] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794237] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794239] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794236] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794234] check:select: PML check not necessary on self
[duvel:3794241] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794238] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794235] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 5 exiting
Process 4 exiting
Process 6 exiting
Process 2 exiting
Process 3 exiting
Process 7 exiting
[duvel:3794235] mca: base: close: component ob1 closed
[duvel:3794235] mca: base: close: unloading component ob1
[duvel:3794240] mca: base: close: component ob1 closed
[duvel:3794240] mca: base: close: unloading component ob1
[duvel:3794236] mca: base: close: component ob1 closed
[duvel:3794236] mca: base: close: unloading component ob1
[duvel:3794239] mca: base: close: component ob1 closed
[duvel:3794239] mca: base: close: unloading component ob1
[duvel:3794241] mca: base: close: component ob1 closed
[duvel:3794241] mca: base: close: unloading component ob1
[duvel:3794237] mca: base: close: component ob1 closed
[duvel:3794237] mca: base: close: unloading component ob1
[duvel:3794234] mca: base: close: component ob1 closed
[duvel:3794234] mca: base: close: unloading component ob1
[duvel:3794238] mca: base: close: component ob1 closed
[duvel:3794238] mca: base: close: unloading component ob1
[duvel:3794230] 7 more processes have sent help message help-mtl-ofi.txt / OFI call fail
[duvel:3794230] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
In both cases the CM PML is selected. However, on wired network all processes leave before completing the CM selection with the error:
PSM3 can't open nic unit: 1 (err=23)
So, back to case 1, the OFI provider failed to correctly initialize the wired network.
although it seems kind of old, this may be related to https://github.com/ofiwg/libfabric/issues/6710
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI_4.1.4-GCC-12.2.0 ]
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
This installation is part of the foss-2022b easybuild source package
Please describe the system on which you are running
ubuntu 22.04. One cloud VM, one laptop (dell). No networking, purely local tests.
Details of the problem
I am testing easybuild on two ubuntu 22.04 systems - one laptop, one cloud VM. The install and test process works as expected on the cloud VM. On the laptop the ring_c test hangs with 8 processes running at 100% load each.
8305 points out that modifying as follows fixes the issue:
I haven't been able to figure out what causes the differences in default configuration between the systems.