open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Spawning process not working with multiple MPI processes #9316

Open goduck777 opened 3 years ago

goduck777 commented 3 years ago

Background information

The function of MPI_Comm_Spawn with multiple MPI processes can lead to an error

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

I have tried v4.1.x, v5.0.x, and master For UCX, I have tried 1.10 and 1.11

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from a source tarball of daily snapshot

Please describe the system on which you are running


Details of the problem

With the latest version of OpenMPI and UCX, the function of MPI_Comm_Spawn can lead to an UCX errror when OpenMPI is running with np larger than 1.

One can run a simple test to reproduce this. This is from #6833.

Parent code:

#define _GNU_SOURCE
#include <sched.h>

#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>

int
main(int argc, char *argv[])
{
    int provided, n, rank, size;
    MPI_Comm intercomm, universe;
    MPI_Win win;
    int *buf;
    MPI_Aint bufsize;
    int disp = sizeof(int);
    int k;

    //MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
    //assert(provided == MPI_THREAD_MULTIPLE);
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    MPI_Comm_spawn("./worker", MPI_ARGV_NULL, 1,
        MPI_INFO_NULL, 0, MPI_COMM_WORLD,
        &intercomm, MPI_ERRCODES_IGNORE);

    MPI_Intercomm_merge(intercomm, 0, &universe);
    MPI_Comm_size(universe, &k);

    bufsize = sizeof(int);

    MPI_Win_allocate_shared(bufsize, 1, MPI_INFO_NULL, universe, &buf, &win);

    buf[0] = 666;

    MPI_Barrier(universe);

    /* Worker runs now */

    MPI_Barrier(universe);

    assert(buf[0] == 555);
    MPI_Finalize();

    return 0;
}

Worker code:

#include <mpi.h>
#include <stddef.h>
#include <assert.h>

int
main(int argc, char *argv[])
{
    MPI_Comm parent, universe;
    int *buf;
    size_t bufsize;
    int rank, disp;
    char hostname[100];
    MPI_Win win;
    MPI_Aint asize;

    MPI_Init(&argc, &argv);

    MPI_Comm_get_parent(&parent);

    assert(parent != MPI_COMM_NULL);

    MPI_Intercomm_merge(parent, 0, &universe);

    MPI_Win_allocate_shared(0, 1, MPI_INFO_NULL, universe, &buf, &win);
    MPI_Win_shared_query(win, MPI_PROC_NULL, &asize, &disp, &buf);

    MPI_Barrier(universe);

    assert(buf[0] == 666);

    buf[0] = 555;

    MPI_Barrier(universe);

    MPI_Finalize();
    return 0;
}

Makefile:

CC=mpicc

CFLAGS=-g -O0

BIN=main worker

all: $(BIN)

clean:
    rm -f $(BIN) *.o

If I run mpirun --oversubscribe -n 1 ./main, I got the correct result.

If I run mpirun --oversubscribe -n 2 ./main, I got the following error.

[1629999678.721002] [traverse:4085807:0]          wireup.c:1038 UCX  ERROR   old: am_lane 0 wireup_msg_lane 2 cm_lane <none> reachable_mds 0x1eb ep_check_map 0x0
[1629999678.721038] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   old: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0
[1629999678.721046] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   old: lane[1]: 19:cma/memory.0 md[8]            -> md[8]/cma/sysdev[255] rma_bw#1
[1629999678.721054] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   old: lane[2]: 10:rc_mlx5/mlx5_0:1.0 md[6]      -> md[6]/ib/sysdev[0] rma_bw#2 wireup
[1629999678.721061] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   old: lane[3]:  8:cuda_ipc/cuda.0 md[5]         -> md[5]/cuda_ipc/sysdev[255] rma_bw#0
[1629999678.721067] [traverse:4085807:0]          wireup.c:1038 UCX  ERROR   new: am_lane 0 wireup_msg_lane 2 cm_lane <none> reachable_mds 0x1eb ep_check_map 0x0
[1629999678.721073] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   new: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0
[1629999678.721080] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   new: lane[1]: 10:rc_mlx5/mlx5_0:1.0 md[6]      -> md[6]/ib/sysdev[0] rma_bw#1
[1629999678.721086] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   new: lane[2]: 15:rc_mlx5/mlx5_2:1.0 md[7]      -> md[7]/ib/sysdev[1] rma_bw#2 wireup
[1629999678.721092] [traverse:4085807:0]          wireup.c:1048 UCX  ERROR   new: lane[3]:  8:cuda_ipc/cuda.0 md[5]         -> md[5]/cuda_ipc/sysdev[255] rma_bw#0
[1629999678.721294] [traverse:4085806:0]          wireup.c:1038 UCX  ERROR   old: am_lane 0 wireup_msg_lane 2 cm_lane <none> reachable_mds 0x1eb ep_check_map 0x0
[1629999678.721308] [traverse:4085806:0]          wireup.c:1048 UCX  ERROR   old: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0
[1629999678.721316] [traverse:4085806:0]          wireup.c:1048 UCX  ERROR   old: lane[1]: 19:cma/memory.0 md[8]            -> md[8]/cma/sysdev[255] rma_bw#1
[1629999678.721323] [traverse:4085806:0]          wireup.c:1048 UCX  ERROR   old: lane[2]: 10:rc_mlx5/mlx5_0:1.0 md[6]      -> md[6]/ib/sysdev[0] rma_bw#2 wireup
[1629999678.721329] [traverse:4085806:0]          wireup.c:1048 UCX  ERROR   old: lane[3]:  8:cuda_ipc/cuda.0 md[5]         -> md[5]/cuda_ipc/sysdev[255] rma_bw#0
[1629999678.721335] [traverse:4085806:0]          wireup.c:1038 UCX  ERROR   new: am_lane 0 wireup_msg_lane 2 cm_lane <none> reachable_mds 0x1eb ep_check_map 0x0
[1629999678.721342] [traverse:4085806:0]          wireup.c:1048 UCX  ERROR   new: lane[0]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] am am_bw#0
[1629999678.721348] [traverse:4085806:0]          wireup.c:1048 UCX  ERROR   new: lane[1]: 10:rc_mlx5/mlx5_0:1.0 md[6]      -> md[6]/ib/sysdev[0] rma_bw#1
[1629999678.721354] [traverse:4085806:0]          wireup.c:1048 UCX  ERROR   new: lane[2]: 15:rc_mlx5/mlx5_2:1.0 md[7]      -> md[7]/ib/sysdev[1] rma_bw#2 wireup

This works in the old OpenMPI (4.0.x) with UCX 1.8.

karasevb commented 3 years ago

@dmitrygx this is still relevant, can close it?

dmitrygx commented 3 years ago

@dmitrygx this is still relevant, can close it?

yes, we can close it

dmitrygx commented 3 years ago

@dmitrygx this is still relevant, can close it?

yes, we can close it

ah, we still have some problem - https://github.com/openucx/ucx/issues/7316#issuecomment-912972449 need to investigate it