mpi-forum / mpi-forum-historic

Migration of old MPI Forum Trac Tickets to GitHub. New issues belong on mpi-forum/mpi-issues.
http://www.mpi-forum.org
2 stars 3 forks source link

MPI_Comm_create_endpoints Proposal #380

Open mpiforumbot opened 8 years ago

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-07-12 15:20:28 -0500


Overview

This proposal introduces a new communicator creation function that can be used to create additional ranks, or endpoints, at an existing MPI process. These new endpoints behave the same as processes and can be associated with threads, allowing threads to fully participate in MPI operations. In contrast to this approach, ticket #288 proposed a static interface, where endpoints were generated when the MPI execution was launched.

Proposed New Function

See attached PDF for updated proposal.

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-07-12 15:43:24 -0500


Attachment added: Endpoints Proposal 3-13-13.pptx (229.1 KiB) Endpoints presentation from March, 2013 meeting

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-07-15 14:16:13 -0500


Updates from 7/15 WG meeting. Cleaned up typos and added advice to users.

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-07-15 15:22:28 -0500


Info hint moved to tt #381.

Added new error class for endpoints-related errors.

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-07-29 16:01:49 -0500


Updates from 7/29 WG meeting.

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-07-29 17:33:05 -0500


We discussed several potential mechanisms that would allow implementations to limit the number of endpoints:

  1. Include only a _my_numep argument; implementations generate an error when the requested number of endpoints can not be created
    1. Provide an attribute that indicates the max available endpoints and warn users about the race between querying and requesting resources.
    2. Remove attribute and rely only on error mechanism (or returning MPI_COMM_NULL?)
  2. Include both requested and provided arguments in MPI_COMM_CREATE_ENDPOINTS.
  3. Add MPI_EP_Alloc(requested, provided) to simultaneously query and reserve endpoints. MPI_EP_Release(num) could be used to release some or all endpoints if the provided number is not workable for the user.
mpiforumbot commented 8 years ago

Originally by jdinan on 2013-08-19 15:07:00 -0500


Integrated feedback from Rajeev.

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-09-14 12:20:50 -0500


Marked attribute as "to be removed." The WG discussed this interface and identified a race between checking the attribute value and requesting endpoints. The preferred mechanism is to rely on MPI errors when endpoints communicator creation fails.

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-10-03 08:29:34 -0500


Attachment added: Endpoints - EuroMPI 2013.pptx (3709.5 KiB) Slide from EuroMPI '13 presentation

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-10-03 08:29:49 -0500


Attachment added: endpoints.pdf (176.7 KiB) EuroMPI '13 paper

mpiforumbot commented 8 years ago

Originally by jdinan on 2013-12-12 09:34:44 -0600


Attachment added: EP Plenary -- 12-11-2013.pptx (744.9 KiB) Endpoints plenary presentation - 12/11/2013

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-02-10 12:28:05 -0600


Attachment added: mpi-report.pdf (2699.9 KiB) Formal proposal for ticket #380 (SVN located at trunk/working-groups/mpi31/ticket-380)

mpiforumbot commented 8 years ago

Originally by balaji on 2014-03-04 18:41:18 -0600


The Forum requested that we redo this ticket with the following changes:

  1. Add MPI_COMM_COMPARE changes.
  2. Add an TOPO_TEST equivalent to check if the communicator is an endpoints communicator or not.
  3. Add a way to figure out how many local endpoints a given communicator has.
mpiforumbot commented 8 years ago

Originally by balaji on 2014-03-04 22:36:37 -0600


Attachment added: mpi-report.2.pdf (4233.9 KiB)

mpiforumbot commented 8 years ago

Originally by balaji on 2014-03-04 22:45:55 -0600


Attachment added: mpi-report.3.pdf (4235.7 KiB)

mpiforumbot commented 8 years ago

Originally by balaji on 2014-03-04 22:46:37 -0600


Uploaded new pdf with fixes to Fortran voodoo, as suggested by JeffS.

mpiforumbot commented 8 years ago

Originally by jsquyres on 2014-03-05 08:47:57 -0600


New PDF looks good.

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-03-05 11:27:41 -0600


Attachment added: mpi-report.4.pdf (2700.3 KiB) Updated to include missing change markers for the comm_compare advice to users.

mpiforumbot commented 8 years ago

Originally by dholmes on 2014-05-19 13:42:45 -0500


We need a way to resolve ambiguities introduces by having multiple end-points for situations where the MPI library must chose a single end-point from several that could "match". Two examples follow.

const int my_num_ep = 2; // NB: could be any non-negative integer
MPI_COMM parent, children[my_num_ep];
MPI_GROUP parentGroup, childGroup;
int ranks1[1], ranks2[1];
parent = <your-favourite-communicator>; // NB: could be an end-points communicator handle!

MPI_COMM_CREATE_ENDPOINTS(parent, my_num_ep, MPI_INFO_NULL, &children);
MPI_COMM_GROUP(parent, &parentGroup);
MPI_COMM_GROUP(children[0], &childGroup);

ranks1[0] = 0;
MPI_GROUP_TRANSLATE_RANKS(parentGroup, 1, ranks1, childGroup, &ranks2);_ ranks2[0] could take any value between 0 and (my_num_ep-1)
_ proposal 1: ranks2[0] should be set to MPI_PROC_NULL because there is no obvious correspondence_ proposal 2: ranks2[0] should be set to MPI_UNDEFINED because there are multiple correct answers
_ proposal 3: ranks2[0] should be set to MPI_AMBIGUOUS because there are multiple correct answers
// proposal 4: ranks2[0] should be set to the unique rank that retained the identity of the parent
const int my_num_ep = 2; // NB: could be any non-negative integer
MPI_COMM parent, children[my_num_ep];
MPI_GROUP parentGroup, childGroup;
int ranks1[1], ranks2[1];
parent = <your-favourite-communicator>; // NB: could be an end-points communicator handle!

MPI_COMM_CREATE_ENDPOINTS(parent, my_num_ep, MPI_INFO_NULL, &children);
MPI_COMM_GROUP(parent, &parentGroup);
MPI_COMM_GROUP(children[0], &childGroup);

MPI_GROUP_UNION(parentGroup, childGroup, unionGroup);_ unionGroup could contain 1 rank, my_num_ep ranks or (my_num_ep+1) ranks
_ is an end-point "the same" as its parent?_ the identity of a group member is not defined clearly enough to answer this sort of question
_ proposal 1: end-points all retain the identity of their parent despite introducing ambiguity_ proposal 2: end-points all have unique identities, which are distinct even to their parent's
_ proposal 3: one end-point retains the identity of its parent, the rest get unique identities

The simplest resolution to this seems to be to designate one of the new end-point communicator handles returned by each call to MPI_COMM_CREATE_ENDPOINTS as special, in that it retains the identity of the parent (possibly end-point) communicator handle.

Here is an initial suggestion for text to add to the MPI Standard as part of the end-points proposal. Between lines 8 and 9 on page 245:

The group associated with new_comm is a superset of the group associated with parent_comm. the communicator handle with an index of 0 in the new_comm array of handles represents the same group member as the parent_comm communicator handle. All communicator handles with an index in new_comm greater than 0 represent new group members.

Rationale: this defines unambiguous responses for operations that compare the constituents of groups, such as MPI_GROUP_TRANSLATE_RANKS, MPI_UNION and MPI_INTERCOMM_MERGE.

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-05-19 20:55:40 -0500


Attachment added: mpi-report.5.pdf (2699.8 KiB) Ticket #380 proposal: Updated communicator comparison text

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-08-11 11:12:28 -0500


Attachment added: mpi-report.6.pdf (2699.6 KiB) Ticket #380 proposal

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-08-11 11:45:46 -0500


Attachment added: EP Plenary -- 05-2014.pptx (497.2 KiB) Endpoints plenary slides from June, 2014 meeting.

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-08-11 11:46:32 -0500


Attachment added: EP Plenary -- 05-2014.pdf (541.0 KiB) Endpoints plenary slides from June, 2014 meeting. (PDF)

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-08-11 12:20:29 -0500


Attachment added: mpi-report.7.pdf (2699.6 KiB) Updated proposal. Sentences were reordered to merge inter/intracommunicator text into the same paragraph and improve readability.

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-09-02 17:05:13 -0500


Attachment added: mpi-report.8.pdf (2699.6 KiB) Updated proposal for vote at September, 2014 meeting

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-11-17 11:30:17 -0600


Attachment added: mpi-report.9.pdf (2699.6 KiB) Updated with feedback from September '14 meeting: s/return error/raise an exception/

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-12-10 08:43:56 -0600


I would not put MPI_COMM_CREATE_ENDPOINTS into the middle of Section 6.4. It would be better to put it at the end of Section 6.4, but best would be to have a new Section 8.9 "Additional Endpoints", i.e., after the MPI "Startup" sections 8.7 and 8.8.

With such a new section, it is usual to write an intro-text.

Here my proposal:

With the startup methods such as mpiexec (see Section 8.8) or MPI_COMM_SPAWN, both together with MPI_INIT, each execution stream (abbreviated with OS process) is also one MPI process. A group or communicator is represented with a handle within each OS process, which represents information about a group of processes and a communication context, and additionally within each OS process, the information which rank in the group of processes is the associated own rank.

MPI_COMM_CREATE_ENDPOINTS can create within each OS process additional MPI processes. MPI_COMM_CREATE_ENDPOINTS does not start any additional application OS processes or threads. The calling MPI process together with the newly started MPI processes are named endpoints, and abbreviated as ranks. The created set of communicator handles represent the same communicator, which consistes of the whole set of endpoints defined by all processes within the group of a parent communicator, but each communicator handle represents another associated own rank.

These communicator handles can be used, for example, by several operating system threads to identify each thread with an own rank.


Additionally, I would add an example with MPI+OpenMP

Example 8.16 Using MPI_COMM_CREATE_ENDPOINTS together with OpenMP

MPI_Init_thread(NULL,NULL, MPI_TREAD_MULTIPLE, &provided);
if (provided < MPI_TREAD_MULTIPLE) MPI_Abort(....);

MPI_Comm_rank(MPI_COMM_WORLD, &my_parent_rank);

my_num_ep = ... /* for the output below, "my_parent_rank+2" is used */

MPI_Comm_create_endpoints(MPI_COMM_WORLD, my_num_ep, MPI_INFO_NULL, new_comm_handles); 

#pragma omp parallel num_treads(my_num_ep)
{
   thread_rank=omp_thread_num();
   new_comm_index = thread_rank;
   MPI_Comm_rank(new_comm_handles[new_comm_index], &my_rank);
   printf("my_parent_rank=%d my_num_ep=%d new_comm_index=%d my_rank=%d (thread_rank=%d)\n",
           my_parent_rank,   my_num_ep,   new_comm_index,   my_rank,    thread_rank);
}

If started on 3 processes with my_num_ep values 2, 3, 4 and sorted by my_rank the following output would be exepcted:

my_parent_rank=0 my_num_ep=2 new_comm_index=0 my_rank=0 (thread_rank=0) [[BR]] my_parent_rank=0 my_num_ep=2 new_comm_index=1 my_rank=1 (thread_rank=1) [[BR]] my_parent_rank=1 my_num_ep=3 new_comm_index=0 my_rank=2 (thread_rank=0) [[BR]] my_parent_rank=1 my_num_ep=3 new_comm_index=1 my_rank=3 (thread_rank=1) [[BR]] my_parent_rank=1 my_num_ep=3 new_comm_index=2 my_rank=4 (thread_rank=2) [[BR]] my_parent_rank=2 my_num_ep=4 new_comm_index=0 my_rank=5 (thread_rank=0) [[BR]] my_parent_rank=2 my_num_ep=4 new_comm_index=1 my_rank=6 (thread_rank=1) [[BR]] my_parent_rank=2 my_num_ep=4 new_comm_index=2 my_rank=7 (thread_rank=2) [[BR]] my_parent_rank=2 my_num_ep=4 new_comm_index=3 my_rank=8 (thread_rank=3)

The relation of my_parent_rank, my_num_ep, new_comm_index and my_rank is given through the definition of MPI_COMM_CREATE_ENDPOINTS. The relation of new_comm_index and thread_rank is defined through the application code in this example.

mpiforumbot commented 8 years ago

Originally by jdinan on 2014-12-15 11:08:31 -0600


Attachment added: EP Feedback - Dec 2014.pdf (2997.8 KiB) Formal proposal marked with feedback gathered during the Dec. 2014 meeting

mpiforumbot commented 8 years ago

Originally by jhammond on 2015-01-15 17:55:15 -0600


Replying to RolfRabenseifner:

With the startup methods such as mpiexec (see Section 8.8) or MPI_COMM_SPAWN, both together with MPI_INIT, each execution stream (abbreviated with OS process) is also one MPI process. A group or communicator is represented with a handle within each OS process, which represents information about a group of processes and a communication context, and additionally within each OS process, the information which rank in the group of processes is the associated own rank.

This is overly specific about a particular manner of implementation and I don't think it is appropriate to include in the standard.

MPI_COMM_CREATE_ENDPOINTS can create within each OS process additional MPI processes.

No. It creates MPI endpoints within an MPI process.

MPI_COMM_CREATE_ENDPOINTS does not start any additional application OS processes or threads.

This should be obvious and need not be said. In the pedantic limit, perhaps an "advice to users" can allude to it.

The calling MPI process together with the newly started MPI processes are named endpoints, and abbreviated as ranks.

Is the notion that processes are abbreviated as ranks present anywhere else in the standard? It is not necessary or appropriate to introduce it here.

The created set of communicator handles represent the same communicator, which consists of the whole set of endpoints defined by all processes within the group of a parent communicator, but each communicator handle represents another associated own rank.

I have no opinion on this text.

These communicator handles can be used, for example, by several operating system threads to identify each thread with an own rank.

This sort of implementation prescription is inappropriate except, perhaps, in any "advice to..."

mpiforumbot commented 8 years ago

Originally by jdinan on 2015-01-26 09:49:13 -0600


I don't share Jeff's negative opinion of these changes. I think there are aspects of MPI_COMM_CREATE_ENDPOINTS that are obvious to us today, but may not be clear to someone reading the specification ten years from now. There is certainly text like what Rolf suggested, that was added in MPI-1 and has been valuable to folks like me who became involved much later.

My only suggestion is that we could pursue this as a separate change/vote to the Forum, so that we can move ahead the main body of the endpoints proposal.

Replying to jhammond:

Replying to RolfRabenseifner:

With the startup methods such as mpiexec (see Section 8.8) or MPI_COMM_SPAWN, both together with MPI_INIT, each execution stream (abbreviated with OS process) is also one MPI process. A group or communicator is represented with a handle within each OS process, which represents information about a group of processes and a communication context, and additionally within each OS process, the information which rank in the group of processes is the associated own rank.

This is overly specific about a particular manner of implementation and I don't think it is appropriate to include in the standard.

mpiforumbot commented 8 years ago

Originally by jdinan on 2015-02-27 17:19:50 -0600


Attachment added: mpi-report.10.pdf (2703.0 KiB) Draft for feedback at March, 2015 meeting

mpiforumbot commented 8 years ago

Originally by dholmes on 2015-06-29 08:20:20 -0500


Attachment added: Outstanding issues with endpoints - May 2015.pdf (248.4 KiB)

jdinan commented 7 years ago

This ticket was migrated to: mpi-forum/mpi-issues#56