open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 860 forks source link

OpenMPI doens't utilise openucx in MPI communications #11419

Closed vasslavich closed 1 year ago

vasslavich commented 1 year ago

Hi, colleagues! I want to run MPI_Send/Recv via OpenMPI + OpenUCX.

OpenMPI commit hash : d5f9e2e78dff3e8be3d9dd70451681dd07093ec4 UCX commit hash: e163451a3afa68fa9c879428471f5a0e66cc459c

$ git submodule status
 415d7044c478b0910c9fbb0f36af700b9483c493 3rd-party/openpmix (v1.1.3-3769-g415d7044)
 dc6ccf65b3356ae7c70bc3a37b4249f03d43966e 3rd-party/prrte (psrvr-v2.0.0rc1-4569-gdc6ccf65b3)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)

I work on a desktop with Ubuntu 20.04

UCX was configured via https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#ucx-installation: $ ../contrib/configure-release --enable-mt --enable-logging --prefix=$PWD/install

OpenMPI was configured via:

$ ../configure CFLAGS=-Wno-cpp --disable-werror --enable-logging --with-ucx=/home/user/devzone/openucx/build-release/install/ --with-ucx-libdir=/home/user/devzone/openucx/build-release/install/lib/ucx --prefix=$PWD/install

OSU microbenchmark was built via:

$ wget -c https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.0.1.tar.gz -O - | tar -xz
$ mkdir build && cd build
$ ../configure CC=/home/user/devzone/openmpi/build-release/install/bin/mpicc CXX=/home/user/devzone/openmpi/build-release/install/bin/mpicxx

If I use the following command (see below), MPI_Send/Recv stack doesn't use UCX routines for P2P communications. They call directly process_vm_readv from OpenMPI SW stack without going to UCX's stack. My mpi program work:

$ ~/devzone/openmpi/build-release/install/bin/mpirun -n 2 --mca orte_base_help_aggregate 0 --mca pml_base_verbose 10 --mca mtl_base_verbose 10 -x OMPI_MCA_pml_ucx_verbose=10 -x UCX_LOG_LEVEL=func -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y --map-by node ~/devzone/OSU/ompi-ucx/osu-micro-benchmarks-7.0.1/build-release/c/mpi/pt2pt/osu_bw

I understand correctly that UCX(UCP,UCT) can give a transport for OpenMPI's communications? If it's right I want MPI_Send/Recv to utilise UCX stack in MPI communications. UCT's shared memory module for an example. I've added ~/devzone/openmpi/build-release/install/bin/mpirun -n 2 -mca pml ucx to configure and it results to the errors:

[lnx-user-vv:476510] select: initializing pml component ucx
[lnx-user-vv:476510] ../../../../../opal/mca/common/ucx/common_ucx.c:312 self/memory: did not match transport list
[lnx-user-vv:476510] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/lo: did not match transport list
[lnx-user-vv:476510] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/eno1: did not match transport list
[lnx-user-vv:476510] ../../../../../opal/mca/common/ucx/common_ucx.c:312 sysv/memory: did not match transport list
[lnx-user-vv:476510] ../../../../../opal/mca/common/ucx/common_ucx.c:312 posix/memory: did not match transport list
[lnx-user-vv:476510] ../../../../../opal/mca/common/ucx/common_ucx.c:312 cma/memory: did not match transport list
[lnx-user-vv:476510] ../../../../../opal/mca/common/ucx/common_ucx.c:317 support level is none
[lnx-user-vv:476510] select: init returned failure for component ucx

No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      lnx-user-vv
  Framework: pml

Thanks a lot for any suggestions!

jjhursey commented 1 year ago

I encountered something like list a while back when trying to test UCX on a machine that only had TCP (so no Mellanox IB devices). IIRC, the problem was that, by default, Open MPI will not use UCX for TCP-only and/or Shared Memory-only communication. So when I tried to force it with -mca pml ucx it would fail since it could not find an interface to use.

The default settings are here.

The workaround was to set the following MCA options on one of the following ways:

  1. Via mpirun add these two CLI options -mca opal_common_ucx_tls all -mca opal_common_ucx_ devices all
  2. Via envars add the following before calling mpirun
    export OMPI_MCA_opal_common_ucx_tls=all
    export OMPI_MCA_opal_common_ucx_ devices=all

Give that a try and see if it helps. If not the @open-mpi/ucx folks might have more suggestions.

vasslavich commented 1 year ago

Thank you @jjhursey! I've tried the both ways, but I’m sorry to say that It didn’t help solve the problem. It seems, there's a some component type mismatch at the line https://github.com/open-mpi/ompi/blob/c0e3a7c64f9e904ec336572869227f52f9ad0246/ompi/mca/pml/base/pml_base_select.c#L98. found_pml doesn't get true.

yosefe commented 1 year ago

It probably happens because of:

    support_level = opal_common_ucx_support_level(ompi_pml_ucx.ucp_context);
    if (support_level == OPAL_COMMON_UCX_SUPPORT_NONE) {
        return NULL;
    }

@vasslavich can you pls add -mca opal_common_ucx_tls all -mca opal_common_ucx_devices all to mpirun command, so the full command will be as below, and post the output?

~/devzone/openmpi/build-release/install/bin/mpirun \
        -n 2 --map-by node \
        --mca orte_base_help_aggregate 0 \
        --mca pml_base_verbose 10 --mca mtl_base_verbose 10 --mca pml_ucx_verbose 10 \
        --mca opal_common_ucx_tls all --mca opal_common_ucx_devices all \
        -x UCX_LOG_LEVEL=info -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y \
        ~/devzone/OSU/ompi-ucx/osu-micro-benchmarks-7.0.1/build-release/c/mpi/pt2pt/osu_bw
vasslavich commented 1 year ago

Thank you for your attendance! I've collected an output on an another desktop so the patches are differ a few.

ompi_info's output

``` $ ~/devzone/OMPI-EXT/build-release/install/bin/ompi_info Package: Open MPI user@lnx-user-vv Distribution Open MPI: 5.1.0a1 Open MPI repo revision: d5f9e2e78d Open MPI release date: Unreleased developer copy MPI API: 3.1.0 Ident string: 5.1.0a1 Prefix: /home/user/devzone/OMPI-EXT/build-release/install Configured architecture: x86_64-pc-linux-gnu Configured by: user Configured on: Thu Feb 16 11:41:35 UTC 2023 Configure host: lnx-user-vv Configure command line: 'CFLAGS=-Wno-cpp' '--disable-werror' '--enable-logging' '--with-ucx=/home/user/devzone/UCX-EXT/build-release/install/' '--with-ucx-libdir=/home/user/devzone/UCX-EXT/build-release/install/lib/' '--prefix=/home/user/devzone/OMPI-EXT/build-release/install' Built by: user Built on: Чт 16 фев 2023 11:45:25 UTC Built host: lnx-user-vv C bindings: yes Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /bin/gcc C compiler family name: GNU C compiler version: 9.4.0 C++ compiler: g++ C++ compiler absolute: /bin/g++ Fort compiler: gfortran Fort compiler abs: /bin/gfortran Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes C profiling: yes Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes IPv6 support: no MPI extensions: affinity, cuda, ftmpi, rocm Fault Tolerance support: yes FT MPI support: yes MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.1.0) MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.1.0) MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.1.0) MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.1.0) MCA btl: uct (MCA v2.1.0, API v3.3.0, Component v5.1.0) MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.1.0) MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.1.0) MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v5.1.0) MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.1.0) MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.1.0) MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.1.0) MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.1.0) MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.1.0) MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component v5.1.0) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.1.0) MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.1.0) MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.1.0) MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.1.0) MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.1.0) MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.1.0) MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.1.0) MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.1.0) MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.1.0) MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.1.0) MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v5.1.0) MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v5.1.0) ```

there is an output of cl from the comment above

``` $ ~/devzone/OMPI-EXT/build-release/install/bin/mpirun -n 2 --map-by node --mca orte_base_help_aggregate 0 --mca pml_base_verbose 10 --mca mtl_base_verbose 10 --mca pml_ucx_verbose 10 --mca opal_common_ucx_tls all --mca opal_common_ucx_devices all -x UCX_LOG_LEVEL=info -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y ~/devzone/OSU/ompi-ucx-ext/osu-micro-benchmarks-7.0.1/build-release/c/mpi/pt2pt/osu_bw [lnx-user-vv:1266489] mca: base: components_register: registering framework pml components [lnx-user-vv:1266489] mca: base: components_register: found loaded component cm [lnx-user-vv:1266489] mca: base: components_register: component cm register function successful [lnx-user-vv:1266489] mca: base: components_register: found loaded component ob1 [lnx-user-vv:1266489] mca: base: components_register: component ob1 register function successful [lnx-user-vv:1266489] mca: base: components_register: found loaded component ucx [lnx-user-vv:1266489] mca: base: components_register: component ucx register function successful [lnx-user-vv:1266489] mca: base: components_register: found loaded component v [lnx-user-vv:1266489] mca: base: components_register: component v register function successful [lnx-user-vv:1266489] mca: base: components_open: opening pml components [lnx-user-vv:1266489] mca: base: components_open: found loaded component cm [lnx-user-vv:1266489] mca: base: components_register: registering framework mtl components [lnx-user-vv:1266489] mca: base: components_open: opening mtl components [lnx-user-vv:1266489] mca: base: close: component cm closed [lnx-user-vv:1266489] mca: base: close: unloading component cm [lnx-user-vv:1266489] mca: base: components_open: found loaded component ob1 [lnx-user-vv:1266489] mca: base: components_open: component ob1 open function successful [lnx-user-vv:1266489] mca: base: components_open: found loaded component ucx [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:156 using OPAL memory hooks as external events [lnx-user-vv:1266489] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:211 mca_pml_ucx_open: UCX version 1.15.0 [1676965826.465573] [lnx-user-vv:1266489:0] ucp_context.c:2121 UCX INFO Version 1.15.0 (loaded from /home/user/devzone/UCX-EXT/build-release/install/lib/libucp.so.0) [lnx-user-vv:1266490] mca: base: components_register: registering framework pml components [lnx-user-vv:1266490] mca: base: components_register: found loaded component cm [lnx-user-vv:1266490] mca: base: components_register: component cm register function successful [lnx-user-vv:1266490] mca: base: components_register: found loaded component ob1 [lnx-user-vv:1266490] mca: base: components_register: component ob1 register function successful [lnx-user-vv:1266490] mca: base: components_register: found loaded component ucx [lnx-user-vv:1266490] mca: base: components_register: component ucx register function successful [lnx-user-vv:1266490] mca: base: components_register: found loaded component v [lnx-user-vv:1266490] mca: base: components_register: component v register function successful [lnx-user-vv:1266490] mca: base: components_open: opening pml components [lnx-user-vv:1266490] mca: base: components_open: found loaded component cm [lnx-user-vv:1266490] mca: base: components_register: registering framework mtl components [lnx-user-vv:1266490] mca: base: components_open: opening mtl components [lnx-user-vv:1266490] mca: base: close: component cm closed [lnx-user-vv:1266490] mca: base: close: unloading component cm [lnx-user-vv:1266490] mca: base: components_open: found loaded component ob1 [lnx-user-vv:1266490] mca: base: components_open: component ob1 open function successful [lnx-user-vv:1266490] mca: base: components_open: found loaded component ucx [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:156 using OPAL memory hooks as external events [lnx-user-vv:1266490] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:211 mca_pml_ucx_open: UCX version 1.15.0 [1676965826.480449] [lnx-user-vv:1266490:0] ucp_context.c:2121 UCX INFO Version 1.15.0 (loaded from /home/user/devzone/UCX-EXT/build-release/install/lib/libucp.so.0) [lnx-user-vv:1266489] mca: base: components_open: component ucx open function successful [lnx-user-vv:1266489] mca: base: components_open: found loaded component v [lnx-user-vv:1266489] mca: base: components_open: component v open function successful [lnx-user-vv:1266489] select: initializing pml component ob1 [lnx-user-vv:1266489] select: init returned priority 20 [lnx-user-vv:1266489] select: initializing pml component ucx [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:312 self/memory: did not match transport list [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/lo: did not match transport list [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/eno1: did not match transport list [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:312 sysv/memory: did not match transport list [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:312 posix/memory: did not match transport list [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:312 cma/memory: did not match transport list [lnx-user-vv:1266489] ../../../../../opal/mca/common/ucx/common_ucx.c:317 support level is none [lnx-user-vv:1266489] select: init returned failure for component ucx [lnx-user-vv:1266489] select: component v not in the include list [lnx-user-vv:1266489] selected ob1 best priority 20 [lnx-user-vv:1266489] select: component ob1 selected [lnx-user-vv:1266489] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:286 mca_pml_ucx_close [lnx-user-vv:1266489] mca: base: close: component ucx closed [lnx-user-vv:1266489] mca: base: close: unloading component ucx [lnx-user-vv:1266489] mca: base: close: component v closed [lnx-user-vv:1266489] mca: base: close: unloading component v [lnx-user-vv:1266490] mca: base: components_open: component ucx open function successful [lnx-user-vv:1266490] mca: base: components_open: found loaded component v [lnx-user-vv:1266490] mca: base: components_open: component v open function successful [lnx-user-vv:1266490] select: initializing pml component ob1 [lnx-user-vv:1266490] select: init returned priority 20 [lnx-user-vv:1266490] select: initializing pml component ucx [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:312 self/memory: did not match transport list [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/lo: did not match transport list [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/eno1: did not match transport list [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:312 sysv/memory: did not match transport list [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:312 posix/memory: did not match transport list [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:312 cma/memory: did not match transport list [lnx-user-vv:1266490] ../../../../../opal/mca/common/ucx/common_ucx.c:317 support level is none [lnx-user-vv:1266490] select: init returned failure for component ucx [lnx-user-vv:1266490] select: component v not in the include list [lnx-user-vv:1266490] selected ob1 best priority 20 [lnx-user-vv:1266490] select: component ob1 selected [lnx-user-vv:1266490] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:286 mca_pml_ucx_close [lnx-user-vv:1266490] mca: base: close: component ucx closed [lnx-user-vv:1266490] mca: base: close: unloading component ucx [lnx-user-vv:1266490] mca: base: close: component v closed [lnx-user-vv:1266490] mca: base: close: unloading component v [lnx-user-vv:1266489] check:select: PML check not necessary on self [lnx-user-vv:1266490] check:select: checking my pml ob1 against process [[33713,1],0] pml ob1 # OSU MPI Bandwidth Test v7.0 # Size Bandwidth (MB/s) 1 13.59 2 25.63 4 48.84 8 108.59 16 212.99 32 420.44 64 647.68 128 894.93 256 1505.38 512 2709.00 1024 5529.78 2048 9594.64 4096 6143.60 8192 9147.19 16384 12053.15 32768 16074.50 65536 18638.40 131072 19869.75 262144 19293.30 524288 19049.99 1048576 19331.92 2097152 18644.67 4194304 14899.71 [lnx-user-vv:1266490] mca: base: close: component ob1 closed [lnx-user-vv:1266490] mca: base: close: unloading component ob1 [lnx-user-vv:1266489] mca: base: close: component ob1 closed [lnx-user-vv:1266489] mca: base: close: unloading component ob1 ```

There's the same cl with -mca pml ucx

``` $ ~/devzone/OMPI-EXT/build-release/install/bin/mpirun -n 2 -mca pml ucx --map-by node --mca orte_base_help_aggregate 0 --mca pml_base_verbose 10 --mca mtl_base_verbose 10 --mca pml_ucx_verbose 10 --mca opal_common_ucx_tls all --mca opal_common_ucx_devices all -x UCX_LOG_LEVEL=info -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y ~/devzone/OSU/ompi-ucx-ext/osu-micro-benchmarks-7.0.1/build-release/c/mpi/pt2pt/osu_bw [lnx-user-vv:1266345] mca: base: components_register: registering framework pml components [lnx-user-vv:1266345] mca: base: components_register: found loaded component ucx [lnx-user-vv:1266346] mca: base: components_register: registering framework pml components [lnx-user-vv:1266346] mca: base: components_register: found loaded component ucx [lnx-user-vv:1266346] mca: base: components_register: component ucx register function successful [lnx-user-vv:1266346] mca: base: components_open: opening pml components [lnx-user-vv:1266346] mca: base: components_open: found loaded component ucx [lnx-user-vv:1266345] mca: base: components_register: component ucx register function successful [lnx-user-vv:1266345] mca: base: components_open: opening pml components [lnx-user-vv:1266345] mca: base: components_open: found loaded component ucx [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:156 using OPAL memory hooks as external events [lnx-user-vv:1266346] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:211 mca_pml_ucx_open: UCX version 1.15.0 [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:156 using OPAL memory hooks as external events [lnx-user-vv:1266345] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:211 mca_pml_ucx_open: UCX version 1.15.0 [1676965655.992331] [lnx-user-vv:1266345:0] ucp_context.c:2121 UCX INFO Version 1.15.0 (loaded from /home/user/devzone/UCX-EXT/build-release/install/lib/libucp.so.0) [1676965655.992334] [lnx-user-vv:1266346:0] ucp_context.c:2121 UCX INFO Version 1.15.0 (loaded from /home/user/devzone/UCX-EXT/build-release/install/lib/libucp.so.0) [lnx-user-vv:1266346] mca: base: components_open: component ucx open function successful [lnx-user-vv:1266345] mca: base: components_open: component ucx open function successful [lnx-user-vv:1266346] select: initializing pml component ucx [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:312 self/memory: did not match transport list [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/lo: did not match transport list [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/eno1: did not match transport list [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:312 sysv/memory: did not match transport list [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:312 posix/memory: did not match transport list [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:312 cma/memory: did not match transport list [lnx-user-vv:1266346] ../../../../../opal/mca/common/ucx/common_ucx.c:317 support level is none [lnx-user-vv:1266346] select: init returned failure for component ucx [lnx-user-vv:1266345] select: initializing pml component ucx [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:312 self/memory: did not match transport list [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/lo: did not match transport list [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:312 tcp/eno1: did not match transport list [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:312 sysv/memory: did not match transport list [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:312 posix/memory: did not match transport list [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:312 cma/memory: did not match transport list [lnx-user-vv:1266345] ../../../../../opal/mca/common/ucx/common_ucx.c:317 support level is none [lnx-user-vv:1266345] select: init returned failure for component ucx -------------------------------------------------------------------------- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: lnx-user-vv Framework: pml -------------------------------------------------------------------------- ```

yosefe commented 1 year ago

@vasslavich can you pls try with --mca pml_ucx_tls any --mca pml_ucx_devices any, seems opal_common prefix is not working as expected:

~/devzone/openmpi/build-release/install/bin/mpirun \
        -n 2 --map-by node \
        --mca orte_base_help_aggregate 0 \
        --mca pml_base_verbose 10 --mca mtl_base_verbose 10 --mca pml_ucx_verbose 10 \
        --mca pml_ucx_tls any --mca pml_ucx_devices any \
        -x UCX_LOG_LEVEL=info -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y \
        ~/devzone/OSU/ompi-ucx/osu-micro-benchmarks-7.0.1/build-release/c/mpi/pt2pt/osu_bw
vasslavich commented 1 year ago

Thanks, did it:

output

``` [lnx-user-vv:1273429] mca: base: components_register: registering framework pml components [lnx-user-vv:1273429] mca: base: components_register: found loaded component cm [lnx-user-vv:1273429] mca: base: components_register: component cm register function successful [lnx-user-vv:1273429] mca: base: components_register: found loaded component ob1 [lnx-user-vv:1273430] mca: base: components_register: registering framework pml components [lnx-user-vv:1273430] mca: base: components_register: found loaded component cm [lnx-user-vv:1273430] mca: base: components_register: component cm register function successful [lnx-user-vv:1273430] mca: base: components_register: found loaded component ob1 [lnx-user-vv:1273429] mca: base: components_register: component ob1 register function successful [lnx-user-vv:1273429] mca: base: components_register: found loaded component ucx [lnx-user-vv:1273430] mca: base: components_register: component ob1 register function successful [lnx-user-vv:1273430] mca: base: components_register: found loaded component ucx [lnx-user-vv:1273429] mca: base: components_register: component ucx register function successful [lnx-user-vv:1273429] mca: base: components_register: found loaded component v [lnx-user-vv:1273429] mca: base: components_register: component v register function successful [lnx-user-vv:1273429] mca: base: components_open: opening pml components [lnx-user-vv:1273429] mca: base: components_open: found loaded component cm [lnx-user-vv:1273429] mca: base: components_register: registering framework mtl components [lnx-user-vv:1273429] mca: base: components_open: opening mtl components [lnx-user-vv:1273429] mca: base: close: component cm closed [lnx-user-vv:1273429] mca: base: close: unloading component cm [lnx-user-vv:1273429] mca: base: components_open: found loaded component ob1 [lnx-user-vv:1273429] mca: base: components_open: component ob1 open function successful [lnx-user-vv:1273429] mca: base: components_open: found loaded component ucx [lnx-user-vv:1273430] mca: base: components_register: component ucx register function successful [lnx-user-vv:1273430] mca: base: components_register: found loaded component v [lnx-user-vv:1273430] mca: base: components_register: component v register function successful [lnx-user-vv:1273430] mca: base: components_open: opening pml components [lnx-user-vv:1273430] mca: base: components_open: found loaded component cm [lnx-user-vv:1273430] mca: base: components_register: registering framework mtl components [lnx-user-vv:1273430] mca: base: components_open: opening mtl components [lnx-user-vv:1273430] mca: base: close: component cm closed [lnx-user-vv:1273430] mca: base: close: unloading component cm [lnx-user-vv:1273430] mca: base: components_open: found loaded component ob1 [lnx-user-vv:1273430] mca: base: components_open: component ob1 open function successful [lnx-user-vv:1273430] mca: base: components_open: found loaded component ucx [lnx-user-vv:1273429] ../../../../../opal/mca/common/ucx/common_ucx.c:156 using OPAL memory hooks as external events [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:211 mca_pml_ucx_open: UCX version 1.15.0 [lnx-user-vv:1273430] ../../../../../opal/mca/common/ucx/common_ucx.c:156 using OPAL memory hooks as external events [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:211 mca_pml_ucx_open: UCX version 1.15.0 [1677048795.495328] [lnx-user-vv:1273429:0] ucp_context.c:2121 UCX INFO Version 1.15.0 (loaded from /home/user/devzone/UCX-EXT/build-release/install/lib/libucp.so.0) [1677048795.495335] [lnx-user-vv:1273430:0] ucp_context.c:2121 UCX INFO Version 1.15.0 (loaded from /home/user/devzone/UCX-EXT/build-release/install/lib/libucp.so.0) [lnx-user-vv:1273430] mca: base: components_open: component ucx open function successful [lnx-user-vv:1273430] mca: base: components_open: found loaded component v [lnx-user-vv:1273430] mca: base: components_open: component v open function successful [lnx-user-vv:1273430] select: initializing pml component ob1 [lnx-user-vv:1273430] select: init returned priority 20 [lnx-user-vv:1273430] select: initializing pml component ucx [lnx-user-vv:1273430] ../../../../../opal/mca/common/ucx/common_ucx.c:242 ucx is enabled on any transport or device [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:302 mca_pml_ucx_init [lnx-user-vv:1273429] mca: base: components_open: component ucx open function successful [lnx-user-vv:1273429] mca: base: components_open: found loaded component v [lnx-user-vv:1273429] mca: base: components_open: component v open function successful [1677048795.543409] [lnx-user-vv:1273430:0] parser.c:2001 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:124 Pack remote worker address, size 80 [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:124 Pack local worker address, size 216 [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:367 created ucp context 0x55758f62bff0, worker 0x7f47bc02e010 [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:147 returning priority 51 [lnx-user-vv:1273430] select: init returned priority 51 [lnx-user-vv:1273430] select: component v not in the include list [lnx-user-vv:1273430] selected ucx best priority 51 [lnx-user-vv:1273430] select: component ucx selected [lnx-user-vv:1273430] select: component ob1 not selected / finalized [lnx-user-vv:1273430] mca: base: close: component ob1 closed [lnx-user-vv:1273430] mca: base: close: unloading component ob1 [lnx-user-vv:1273430] mca: base: close: component v closed [lnx-user-vv:1273430] mca: base: close: unloading component v [lnx-user-vv:1273429] select: initializing pml component ob1 [lnx-user-vv:1273429] select: init returned priority 20 [lnx-user-vv:1273429] select: initializing pml component ucx [lnx-user-vv:1273429] ../../../../../opal/mca/common/ucx/common_ucx.c:242 ucx is enabled on any transport or device [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:302 mca_pml_ucx_init [1677048795.547561] [lnx-user-vv:1273429:0] parser.c:2001 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:124 Pack remote worker address, size 80 [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:124 Pack local worker address, size 216 [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:367 created ucp context 0x5619b5cbbfe0, worker 0x7f75c8021010 [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:147 returning priority 51 [lnx-user-vv:1273429] select: init returned priority 51 [lnx-user-vv:1273429] select: component v not in the include list [lnx-user-vv:1273429] selected ucx best priority 51 [lnx-user-vv:1273429] select: component ucx selected [lnx-user-vv:1273429] select: component ob1 not selected / finalized [lnx-user-vv:1273429] mca: base: close: component ob1 closed [lnx-user-vv:1273429] mca: base: close: unloading component ob1 [lnx-user-vv:1273429] mca: base: close: component v closed [lnx-user-vv:1273429] mca: base: close: unloading component v [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:192 Got proc 0 address, size 216 [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:432 connecting to proc. 0 [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:192 Got proc 1 address, size 216 [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:432 connecting to proc. 1 [1677048795.548509] [lnx-user-vv:1273429:0] +---------------------------+---------------------------------------------------------+ [1677048795.548515] [lnx-user-vv:1273429:0] | 0x5619b5cbbfe0 self cfg#0 | tagged message by ucp_tag_send* from host memory | [1677048795.548516] [lnx-user-vv:1273429:0] +---------------------------+-------------------------------------------+-------------+ [1677048795.548517] [lnx-user-vv:1273429:0] | 0..6868 | eager short | self/memory | [1677048795.548518] [lnx-user-vv:1273429:0] | 6869..inf | (?) rendezvous zero-copy read from remote | cma/memory | [1677048795.548519] [lnx-user-vv:1273429:0] +---------------------------+-------------------------------------------+-------------+ [1677048795.548511] [lnx-user-vv:1273430:0] +---------------------------+---------------------------------------------------------+ [1677048795.548515] [lnx-user-vv:1273430:0] | 0x55758f62bff0 self cfg#0 | tagged message by ucp_tag_send* from host memory | [1677048795.548517] [lnx-user-vv:1273430:0] +---------------------------+-------------------------------------------+-------------+ [1677048795.548518] [lnx-user-vv:1273430:0] | 0..6868 | eager short | self/memory | [1677048795.548519] [lnx-user-vv:1273430:0] | 6869..inf | (?) rendezvous zero-copy read from remote | cma/memory | [1677048795.548520] [lnx-user-vv:1273430:0] +---------------------------+-------------------------------------------+-------------+ [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:192 Got proc 1 address, size 216 [lnx-user-vv:1273429] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:432 connecting to proc. 1 [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:192 Got proc 0 address, size 216 [lnx-user-vv:1273430] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:432 connecting to proc. 0 [1677048795.549666] [lnx-user-vv:1273429:0] +---------------------------------+---------------------------------------------------------+ [1677048795.549671] [lnx-user-vv:1273429:0] | 0x5619b5cbbfe0 intra-node cfg#1 | tagged message by ucp_tag_send* from host memory | [1677048795.549674] [lnx-user-vv:1273429:0] +---------------------------------+-------------------------------------------+-------------+ [1677048795.549675] [lnx-user-vv:1273429:0] | 0..92 | eager short | sysv/memory | [1677048795.549677] [lnx-user-vv:1273429:0] | 93..8248 | eager copy-in copy-out | sysv/memory | [1677048795.549679] [lnx-user-vv:1273429:0] | 8249..10434 | multi-frag eager copy-in copy-out | sysv/memory | [1677048795.549682] [lnx-user-vv:1273429:0] | 10435..inf | (?) rendezvous zero-copy read from remote | cma/memory | [1677048795.549683] [lnx-user-vv:1273429:0] +---------------------------------+-------------------------------------------+-------------+ [1677048795.549690] [lnx-user-vv:1273430:0] +---------------------------------+---------------------------------------------------------+ [1677048795.549696] [lnx-user-vv:1273430:0] | 0x55758f62bff0 intra-node cfg#1 | tagged message by ucp_tag_send* from host memory | [1677048795.549698] [lnx-user-vv:1273430:0] +---------------------------------+-------------------------------------------+-------------+ [1677048795.549700] [lnx-user-vv:1273430:0] | 0..92 | eager short | sysv/memory | [1677048795.549701] [lnx-user-vv:1273430:0] | 93..8248 | eager copy-in copy-out | sysv/memory | [1677048795.549703] [lnx-user-vv:1273430:0] | 8249..10434 | multi-frag eager copy-in copy-out | sysv/memory | [1677048795.549704] [lnx-user-vv:1273430:0] | 10435..inf | (?) rendezvous zero-copy read from remote | cma/memory | [1677048795.549706] [lnx-user-vv:1273430:0] +---------------------------------+-------------------------------------------+-------------+ # OSU MPI Bandwidth Test v7.0 # Size Bandwidth (MB/s) [1677048795.550919] [lnx-user-vv:1273429:0] +---------------------------------+------------------------------------------------------------+ [1677048795.550924] [lnx-user-vv:1273429:0] | 0x5619b5cbbfe0 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from host memory | [1677048795.550925] [lnx-user-vv:1273429:0] +---------------------------------+----------------------------------------------+-------------+ [1677048795.550926] [lnx-user-vv:1273429:0] | 0..92 | eager short | sysv/memory | [1677048795.550927] [lnx-user-vv:1273429:0] | 93..8248 | eager copy-in copy-out | sysv/memory | [1677048795.550928] [lnx-user-vv:1273429:0] | 8249..15491 | multi-frag eager copy-in copy-out | sysv/memory | [1677048795.550929] [lnx-user-vv:1273429:0] | 15492..inf | (?) rendezvous zero-copy read from remote | cma/memory | [1677048795.550930] [lnx-user-vv:1273429:0] +---------------------------------+----------------------------------------------+-------------+ 1 12.35 2 24.90 4 49.64 8 97.81 16 194.38 32 372.19 64 762.27 128 1114.25 256 2180.67 512 3681.70 1024 6178.84 2048 9752.58 4096 13728.26 8192 18042.38 [1677048795.566183] [lnx-user-vv:1273430:0] +---------------------------------+------------------------------------------------------------+ [1677048795.566188] [lnx-user-vv:1273430:0] | 0x55758f62bff0 intra-node cfg#1 | rendezvous data fetch(multi) into host memory from host | [1677048795.566190] [lnx-user-vv:1273430:0] +---------------------------------+----------------------------------------------+-------------+ [1677048795.566192] [lnx-user-vv:1273430:0] | 0 | no data fetch | | [1677048795.566194] [lnx-user-vv:1273430:0] | 1..13053 | (?) fragmented copy-in copy-out | sysv/memory | [1677048795.566195] [lnx-user-vv:1273430:0] | 13054..inf | zero-copy read from remote | cma/memory | [1677048795.566196] [lnx-user-vv:1273430:0] +---------------------------------+----------------------------------------------+-------------+ [1677048795.567121] [lnx-user-vv:1273429:0] ucp_ep.c:1490 UCX DIAG ep 0x7f75c8001040: error 'Connection reset by remote peer' on tcp/lo will not be handled since no error callback is installed [lnx-user-vv:00000] *** An error occurred in MPI_Waitall [lnx-user-vv:00000] *** reported by process [2031943681,0] [lnx-user-vv:00000] *** on communicator MPI_COMM_WORLD [lnx-user-vv:00000] *** MPI_ERR_INTERN: internal error [lnx-user-vv:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [lnx-user-vv:00000] *** and MPI will try to terminate your MPI job as well) ```

yosefe commented 1 year ago

@vasslavich according to the log, adding --mca pml_ucx_tls any --mca pml_ucx_devices any enabled UCX.

vasslavich commented 1 year ago

Hmm, @yosefe thank you a lot! Could you please suggest what args should I pass to mpirun for to get the UCX's logs? I be interested in:

Thank you!

yosefe commented 1 year ago
vasslavich commented 1 year ago

@yosefe , thank you a lot!