upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

Create a global job communicator #6

Closed upperwal closed 6 years ago

upperwal commented 6 years ago

global job communicator will be among nodes in a single job, could be used to communicate among node within a job.

Problem: MPI_COMM_WORLD.nodeID -> global_job_comm.nodeID mapping is not avaliable.

Possible solution: MPI_Comm_set_attr in MPI_COMM_WORLD to store global_job_comm.nodeID and visa versa.

upperwal commented 6 years ago

Elegant solution is to use MPI_Group_translate_ranks function.

upperwal commented 6 years ago

world_job_comm added to Node struct

typedef struct Nodes {
    int job_id;
    int rank;
    int age;
    int jobs_count;
    enum NodeTransitState node_transit_state;
    enum NodeCheckpointMaster node_checkpoint_master;
    MPI_Comm rep_mpi_comm_world;    // Duplicate of MPI_COMM_WORLD
    MPI_Comm world_job_comm;        // Communicator to all nodes in a job.
    MPI_Comm active_comm;           // Communicator of nodes, one from each job. So these can be called active nodes.
} Node;