open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

mpirun segfault in mpi_comm_connect/accept on multimachine jobs #6818

Open zack-hable opened 5 years ago

zack-hable commented 5 years ago

Background information

What version of Open MPI are you using?

Describe how Open MPI was installed

Please describe the system on which you are running


Details of the problem

Steps to reproduce

Observed Behavior

When running a server job via SGE (or slurm and torque) that spans multiple machines and attempting to connect with a client results in the client getting a seg fault in the mpirun executable. However when the job only runs on a single machine then the connection works. (use -pe mpi 1 in Server/Client.job files)

Expected Behavior

The clients should connect to each server instance and then send "Hello World" to one of them. (ie: client 1 and 2 connect to server 1 and 2, but client 1 sends to server 1 and client 2 sends to server 2)

Notes

Other Versions

I've tested the other following versions of OpenMPI (same install procedure as above) and the observed output was:

Output

client.out

0 WS: 2
1 WS: 2
0 connecting to name worker0
1 connecting to name worker0
[ip-YYY-YYY-YYY-YYY:20938] *** Process received signal ***
[ip-YYY-YYY-YYY-YYY:20938] Signal: Segmentation fault (11)
[ip-YYY-YYY-YYY-YYY:20938] Signal code: Address not mapped (1)
[ip-YYY-YYY-YYY-YYY:20938] Failing at address: (nil)
[ip-YYY-YYY-YYY-YYY:20938] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x151ccad775e0]
[ip-YYY-YYY-YYY-YYY:20938] [ 1] /lib64/libc.so.6(+0x8c421)[0x151ccaa27421]
[ip-YYY-YYY-YYY-YYY:20938] [ 2] /lib64/libc.so.6(__strdup+0xe)[0x151ccaa2712e]
[ip-YYY-YYY-YYY-YYY:20938] [ 3] /shared/mpi/lib/libopen-rte.so.40(orte_rml_base_parse_uris+0x15)[0x151ccc046e85]
[ip-YYY-YYY-YYY-YYY:20938] [ 4] /shared/mpi/lib/libopen-rte.so.40(+0x45902)[0x151ccc00d902]
[ip-YYY-YYY-YYY-YYY:20938] [ 5] /shared/mpi/lib/libopen-rte.so.40(pmix_server_keyval_client+0x4c5)[0x151ccc0118d5]
[ip-YYY-YYY-YYY-YYY:20938] [ 6] /shared/mpi/lib/libopen-rte.so.40(orte_rml_base_process_msg+0x40b)[0x151ccc04734b]
[ip-YYY-YYY-YYY-YYY:20938] [ 7] /shared/mpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xd99)[0x151ccbd35bf9]
[ip-YYY-YYY-YYY-YYY:20938] [ 8] /shared/mpi/lib/libopen-rte.so.40(orte_daemon+0x11e4)[0x151ccc0016a4]
[ip-YYY-YYY-YYY-YYY:20938] [ 9] /shared/mpi/bin/orted[0x40076b]
[ip-YYY-YYY-YYY-YYY:20938] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x151cca9bd445]
[ip-YYY-YYY-YYY-YYY:20938] [11] /shared/mpi/bin/orted[0x4007a7]
[ip-YYY-YYY-YYY-YYY:20938] *** End of error message ***
[ip-ZZZ-ZZZ-ZZZ:21020] *** Process received signal ***
[ip-ZZZ-ZZZ-ZZZ:21020] Signal: Segmentation fault (11)
[ip-ZZZ-ZZZ-ZZZ:21020] Signal code: Address not mapped (1)
[ip-ZZZ-ZZZ-ZZZ:21020] Failing at address: (nil)
[ip-ZZZ-ZZZ-ZZZ:21020] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x14b556b4a5e0]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 1] /lib64/libc.so.6(+0x8c421)[0x14b5567fa421]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 2] /lib64/libc.so.6(__strdup+0xe)[0x14b5567fa12e]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 3] /shared/mpi/lib/libopen-rte.so.40(orte_rml_base_parse_uris+0x15)[0x14b557e19e85]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 4] /shared/mpi/lib/libopen-rte.so.40(+0x45902)[0x14b557de0902]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 5] /shared/mpi/lib/libopen-rte.so.40(pmix_server_keyval_client+0x4c5)[0x14b557de48d5]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 6] /shared/mpi/lib/libopen-rte.so.40(orte_rml_base_process_msg+0x40b)[0x14b557e1a34b]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 7] /shared/mpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xd99)[0x14b557b08bf9]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 8] /shared/mpi/bin/mpirun[0x40118a]
[ip-ZZZ-ZZZ-ZZZ:21020] [ 9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x14b556790445]
[ip-ZZZ-ZZZ-ZZZ:21020] [10] /shared/mpi/bin/mpirun[0x400dfe]
[ip-ZZZ-ZZZ-ZZZ:21020] *** End of error message ***

server.out

0 publishing name: worker0
1 publishing name: worker1
0 waiting for connect!
1 waiting for connect!
[ip-XXX-XXX-XXX-XXX:12543] *** Process received signal ***
[ip-XXX-XXX-XXX-XXX:12543] Signal: Segmentation fault (11)
[ip-XXX-XXX-XXX-XXX:12543] Signal code: Address not mapped (1)
[ip-XXX-XXX-XXX-XXX:12543] Failing at address: (nil)
[ip-XXX-XXX-XXX-XXX:12543] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x1462f96ea5e0]
[ip-XXX-XXX-XXX-XXX:12543] [ 1] /lib64/libc.so.6(+0x8c421)[0x1462f939a421]
[ip-XXX-XXX-XXX-XXX:12543] [ 2] /lib64/libc.so.6(__strdup+0xe)[0x1462f939a12e]
[ip-XXX-XXX-XXX-XXX:12543] [ 3] /shared/mpi/lib/libopen-rte.so.40(orte_rml_base_parse_uris+0x15)[0x1462fa9b9e85]
[ip-XXX-XXX-XXX-XXX:12543] [ 4] /shared/mpi/lib/libopen-rte.so.40(+0x45902)[0x1462fa980902]
[ip-XXX-XXX-XXX-XXX:12543] [ 5] /shared/mpi/lib/libopen-rte.so.40(pmix_server_keyval_client+0x4c5)[0x1462fa9848d5]
[ip-XXX-XXX-XXX-XXX:12543] [ 6] /shared/mpi/lib/libopen-rte.so.40(orte_rml_base_process_msg+0x40b)[0x1462fa9ba34b]
[ip-XXX-XXX-XXX-XXX:12543] [ 7] /shared/mpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xd99)[0x1462fa6a8bf9]
[ip-XXX-XXX-XXX-XXX:12543] [ 8] /shared/mpi/bin/mpirun[0x40118a]
[ip-XXX-XXX-XXX-XXX:12543] [ 9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x1462f9330445]
[ip-XXX-XXX-XXX-XXX:12543] [10] /shared/mpi/bin/mpirun[0x400dfe]
[ip-XXX-XXX-XXX-XXX:12543] *** End of error message ***

Source Files

client.cpp

#include <mpi.h>
#include <string>
#include <iostream>
#include <sstream>
#include <vector>

using namespace std;

namespace patch
{
    template < typename T > std::string to_string( const T& n )
    {
        std::ostringstream stm ;
        stm << n ;
        return stm.str() ;
    }
}

int main(int argc, char** argv);

int main(int argc, char** argv) {
    // start MPI
    MPI_Init(NULL, NULL);

    // get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    cout << world_rank << " WS: " << patch::to_string(world_size) << endl;    

    vector<MPI_Comm> comms;
    for (int i=0; i<world_size; i++) {
        // set service name
        string name = "worker"+patch::to_string(i);
        // get the port
        char port[MPI_MAX_PORT_NAME];
        MPI_Lookup_name(name.c_str(), MPI_INFO_NULL, port); 
        cout << world_rank << " connecting to name " << name << endl;
        // connect to the port
        MPI_Comm remote;
        MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &remote);

        comms.push_back(remote);
    }
    // send a message (only from master)
    std::string data("Hello World!");

    cout << world_rank << " sending to worker" << world_rank << endl;
    MPI_Send(&(data[0]), data.size(), MPI_CHAR, 0, 0, comms[world_rank]);
    // free/disconnect from comm

    for (int i=0; i<comms.size(); i++) {
    cout << world_rank << " freeing comm: " << i << endl;
    MPI_Comm_disconnect(&(comms[i]));
    }

    // exit
    MPI_Finalize();
    return 0;
}

server.cpp

#include <mpi.h>
#include <string>
#include <iostream>
#include <sstream>

using std::cout;
using std::endl;
using std::string;

namespace patch
{
    template < typename T > std::string to_string( const T& n )
    {
        std::ostringstream stm ;
        stm << n ;
        return stm.str() ;
    }
}

int main(int argc, char** argv);

int main(int argc, char** argv) {
    // start MPI
    MPI_Init(NULL, NULL);

    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    // make port
    char port[MPI_MAX_PORT_NAME];
    MPI_Open_port(MPI_INFO_NULL, port);
    // publish name
    MPI_Info info;
    MPI_Info_create(&info);
    MPI_Info_set(info, "ompi_global_scope", "true");
    string name = "worker"+patch::to_string(world_rank);
    cout << world_rank << " publishing name: " << name << endl;
    MPI_Publish_name(name.c_str(), info, port);
    cout << world_rank << " waiting for connect!" << endl;
    // receive connection
    MPI_Comm comm;
    MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm);
    cout << world_rank << " connect established" << endl;

    // receive message
    string data;
    data.resize(100);
    MPI_Status stat;
    cout << world_rank << " waiting for remote: " << 0  << endl;
    MPI_Recv(&(data[0]), 12, MPI_CHAR, MPI_ANY_SOURCE, 0, comm, &stat);
    // output result
    cout << world_rank << " " << data << " from " << patch::to_string(stat.MPI_SOURCE) <<  endl;

    // unpublish name
    MPI_Unpublish_name(name.c_str(), info, port);
    // close port
    MPI_Close_port(port);

    cout << world_rank << " disconnecting from comm..." << endl;
    // close comm
    MPI_Comm_disconnect(&comm);  

    cout << world_rank << " has finished!" << endl;
    // exit
    MPI_Finalize();
    return 0;
}

Client.job

#!/bin/sh
#$ -cwd
#$ -N Client
#$ -pe mpi 2
#$ -o /shared/tmpMPI/client.out
#$ -j y

sudo yum remove openmpi openmpi-devel -y # remove default openmpi on machine
export LD_LIBRARY_PATH=/shared/mpi/lib:/opt/amazon/efa/lib64
/shared/mpi/bin/mpirun --ompi-server "$URI" /shared/tmpMPI/client

Server.job

#!/bin/sh
#$ -cwd
#$ -N Server
#$ -pe mpi 2
#$ -o /shared/tmpMPI/server.out
#$ -j y

sudo yum remove openmpi openmpi-devel -y  # remove default OpenMPI 3.1.4 on instance
export LD_LIBRARY_PATH=/shared/mpi/lib:/opt/amazon/efa/lib64
export OMPI_MCA_pmix_server_max_wait=-1
/shared/mpi/bin/mpirun --ompi-server "$URI" /shared/tmpMPI/server
wckzhang commented 1 year ago

Sorry for the very late response, if it is still applicable, I have a few questions about this issue:

  1. It looks like you're compiling trying to use EFA, are you aware that t2.micro does not support EFA networking?
  2. I also notice that V1.10.7 : This works as expected (I cannot use this version as it will not support EFA networking) you require EFA networking, is this an issue that you experience with EFA instance types or just with t2.micro and TCP?
  3. The 4.1.x branch series has ingested many bugfixes since this issue was created, does a newer release resolve this issue?