open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 859 forks source link

Is UCX working with MPI-Sessions? #12566

Closed TimEllersiek closed 1 week ago

TimEllersiek commented 5 months ago

UCX and MPI-Sessions

When I try to use OpenMPI with USX on our small University-Cluster I got an error message saying that MPI-Session Features are not supported by UCX (The Cluster uses an Infiniband connection). However, when I install it on my Local-Machine (ArchLinux) all seems to work fine. So I'm wondering whether the MPI-Sessions are supported by UCX or not?

Source Code (main.c):

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void function_my_session_errhandler(MPI_Session *foo, int *bar, ...) {
    fprintf(stderr, "my error handler called here with error %d\n", *bar);
}

void function_check_print_error(char *format, int rc) {
    if (MPI_SUCCESS != rc) {
        fprintf(stderr, format, rc);
        abort();
    }
}

int main(int argc, char *argv[]) {
    MPI_Session session;
    MPI_Errhandler errhandler;
    MPI_Group group;
    MPI_Comm comm_world, comm_self;
    MPI_Info info;
    int rc, npsets, one = 1, sum;

    rc = MPI_Session_create_errhandler(function_my_session_errhandler, &errhandler);
    function_check_print_error("Error handler creation failed with rc = %d\n", rc);

    rc = MPI_Info_create(&info);
    function_check_print_error("Info creation failed with rc = %d\n", rc);

    rc = MPI_Info_set(info, "thread_level", "MPI_THREAD_MULTIPLE");
    function_check_print_error("Info key/val set failed with rc = %d\n", rc);

    rc = MPI_Session_init(info, errhandler, &session);
    function_check_print_error("Session initialization failed with rc = %d\n", rc);

    rc = MPI_Session_get_num_psets(session, MPI_INFO_NULL, &npsets);
    function_check_print_error(" with rc = %d\n", rc);

    for (int i = 0; i < npsets; i++) {
        int psetlen = 0;
        char pset_name[256];

        MPI_Session_get_nth_pset(session, MPI_INFO_NULL, i, &psetlen, NULL);
        MPI_Session_get_nth_pset(session, MPI_INFO_NULL, i, &psetlen, pset_name);
        fprintf(stderr, "  PSET %d: %s (len: %d)\n", i, pset_name, psetlen);
    }

    rc = MPI_Group_from_session_pset(session, "mpi://WORLD", &group);
    function_check_print_error("Could not get a group for mpi://WORLD. rc = %d\n", rc);

    rc = MPI_Comm_create_from_group(group, "my_world", MPI_INFO_NULL, MPI_ERRORS_RETURN, &comm_world);
    function_check_print_error("Could not create Communicator my_world. rc = %d\n", rc);

    MPI_Group_free(&group);

    MPI_Allreduce(&one, &sum, 1, MPI_INT, MPI_SUM, comm_world);

    fprintf(stderr, "World Comm Sum (1): %d\n", sum);

    rc = MPI_Group_from_session_pset(session, "mpi://SELF", &group);
    function_check_print_error("Could not get a group for mpi://SELF. rc = %d\n", rc);

    MPI_Comm_create_from_group(group, "myself", MPI_INFO_NULL, MPI_ERRORS_RETURN, &comm_self);
    MPI_Group_free(&group);

    MPI_Allreduce(&one, &sum, 1, MPI_INT, MPI_SUM, comm_self);

    fprintf(stderr, "Self Comm Sum (1): %d\n", sum);

    MPI_Errhandler_free(&errhandler);
    MPI_Info_free(&info);
    MPI_Comm_free(&comm_world);
    MPI_Comm_free(&comm_self);
    MPI_Session_finalize(&session);

    return 0;
}

Commands used to compile and run

mpicc \-o main main.c
mpirun -np 1 -mca osc ucx out/main

Console Output Uni-Cluster

$ mpirun -np 1 -mca pml ucx main
  PSET 0: mpi://WORLD (len: 12)
  PSET 1: mpi://SELF (len: 11)
  PSET 2: mpix://SHARED (len: 14)
Could not create Communicator my_world. rc = 52
[nv46:97180] *** Process received signal ***
[nv46:97180] Signal: Aborted (6)
[nv46:97180] Signal code:  (-6)
--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_from_group/MPI_Intercomm_from_groups
  Reason:       The PML being used - ucx - does not support MPI sessions related features
--------------------------------------------------------------------------
[nv46:97180] [ 0] /usr/lib/libc.so.6(+0x3c770)[0x72422de41770]
[nv46:97180] [ 1] /usr/lib/libc.so.6(+0x8d32c)[0x72422de9232c]
[nv46:97180] [ 2] /usr/lib/libc.so.6(gsignal+0x18)[0x72422de416c8]
[nv46:97180] [ 3] /usr/lib/libc.so.6(abort+0xd7)[0x72422de294b8]
[nv46:97180] [ 4] main(+0x12f4)[0x6239e33802f4]
[nv46:97180] [ 5] main(+0x1585)[0x6239e3380585]
[nv46:97180] [ 6] /usr/lib/libc.so.6(+0x25cd0)[0x72422de2acd0]
[nv46:97180] [ 7] /usr/lib/libc.so.6(__libc_start_main+0x8a)[0x72422de2ad8a]
[nv46:97180] [ 8] main(+0x1165)[0x6239e3380165]
[nv46:97180] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 97180 on node nv46 exited on
signal 6 (Aborted).

Console Output Local:

$ mpirun -np 1 -mca osc ucx main
  PSET 0: mpi://WORLD (len: 12)
  PSET 1: mpi://SELF (len: 11)
  PSET 2: mpix://SHARED (len: 14)
  World Comm Sum (1): 1
  Self Comm Sum (1): 1

Installation

Small Uni-Cluster

UCX Output

Output von configure-release:

[[
configure:           ASAN check:   no
configure:         Multi-thread:   disabled
configure:            MPI tests:   disabled
configure:          VFS support:   yes
configure:        Devel headers:   no
configure: io_demo CUDA support:   no
configure:             Bindings:   < >
configure:          UCS modules:   < fuse >
configure:          UCT modules:   < ib rdmacm cma >
configure:         CUDA modules:   < >
configure:         ROCM modules:   < >
configure:           IB modules:   < >
configure:          UCM modules:   < >
configure:         Perf modules:   < >
]]

Output make install:

$UCXFOLDER/myinstall/bin/ucx_info -v
[[
# Library version: 1.17.0
# Library path: ${HOME}/itoyori/ucx/myinstall/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'master', revision a48ad8f
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=${HOME}/itoyori/ucx/myinstall --without-go
]]

OpenMPI

Output von configure:

[[
Open MPI configuration:
-----------------------
Version: 5.0.3
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Build MPI Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: external
libevent: external
Open UCC: no
pmix: external
PRRTE: external
Threading Package: pthreads

Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: yes
OpenFabrics OFI Libfabric: yes (pkg-config: default search paths)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Accelerators
-----------------------
CUDA support: no
ROCm support: no

OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no
]]

Local

UCX Output

Output von configure-release:

configure: =========================================================
configure: UCX build configuration:
configure:         Build prefix:   ${HOME}/ucx/myinstall
configure:    Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:           C compiler:   gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement
configure:         C++ compiler:   g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure:         Multi-thread:   disabled
configure:            MPI tests:   disabled
configure:          VFS support:   yes
configure:        Devel headers:   no
configure: io_demo CUDA support:   no
configure:             Bindings:   < >
configure:          UCS modules:   < fuse >
configure:          UCT modules:   < cma >
configure:         CUDA modules:   < >
configure:         ROCM modules:   < >
configure:           IB modules:   < >
configure:          UCM modules:   < >
configure:         Perf modules:   < >
configure: =========================================================

Output make install:

$UCXFOLDER/myinstall/bin/ucx_info -v
# Library version: 1.16.0
# Library path: ${HOME}/ucx/myinstall/lib/libucs.so.0
# API headers version: 1.16.0
# Git branch '', revision e4bb802
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=${HOME}/ucx/myinstall --without-go

OpenMPI Output

Output von configure:

Open MPI configuration:
-----------------------
Version: 5.0.3
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: no
Build MPI Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: internal
libevent: external
Open UCC: no
pmix: internal
PRRTE: internal
Threading Package: pthreads

Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: yes
OpenFabrics OFI Libfabric: no (not found)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Accelerators
-----------------------
CUDA support: no
ROCm support: no

OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no

MPI and UCX Installation

Ordnerstruktur:

${HOME}/ucx
${HOME}/openmpi-5.0.3

Install OpenUCX

cd ${HOME}
git clone https://github.com/openucx/ucx.git
cd ucx
git checkout v1.16.0
export UCXFOLDER=${HOME}/ucx
./autogen.sh
./contrib/configure-release --prefix=$UCXFOLDER/myinstall --without-go

Install:

make -j32
make install

OpenMPI

cd ${HOME}
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar xfvz openmpi-5.0.3.tar.gz
export MPIFOLDER=${HOME}/openmpi-5.0.3
cd $MPIFOLDER
./configure --disable-io-romio --with-io-romio-flags=--without-ze --disable-sphinx --prefix="$MPIFOLDER/myinstall" --with-ucx="$UCXFOLDER/myinstall" 2>&1 | tee config.out

Install:

make -j32 all 2>&1 | tee make.out
make install 2>&1 | tee install.out
export OMPI="${MPIFOLDER}/myinstall"
export PATH=$OMPI/bin:$PATH
export LD_LIBRARY_PATH=$OMPI/lib:$LD_LIBRARY_PATH
janjust commented 5 months ago

So I'm wondering whether the MPI-Sessions are supported by UCX or not?

Yes, that's the case, MPI-Sessions are not supported by UCX.

hppritcha commented 5 months ago

On the feature list for the next major release.

TimEllersiek commented 5 months ago

Thanks for the answers.

devreal commented 5 months ago

Let's keep this open until it's fixed. Other people will probably run into this too.

jprotze commented 3 months ago

Is there any mode to execute OpenMPI 5 with sessions?

hppritcha commented 1 week ago

closed via #12723

hppritcha commented 1 week ago

No plans currently to push these changes back to v5.0.x branch.