Originally by traff on 2008-10-10 08:27:53 -0500

Fix Scalability Issues in Graph Topology InterfaceAuthor: Torsten Hoefler, Jesper Larsson Traeff with comments from Rolf

Rabenseifner, Hubert Ritzdorf, and Rajeev Thakur

Status

Sep 8 2008 Straw votes

For a scalable description

|| Yes || Maybe || No for 2.2 || Abstain || || 7 || 20 ||0 || 3 ||

For adding info to weight of edges

|| Yes || Maybe || No for 2.2 || Abstain || || 4 || 22 || 1 || 6 ||

October 2008 Straw Vote

|| Yes || Maybe || No for 2.2 || Abstain || || 23 || 0 || 0 || 5 ||

April 09 Straw Votes

Fully flexible version vs. only adjacent version.

|| only flexible || only adjacent || both || || 9 || 2 || 8 ||

Using MPI_Info arguments for any assertions (that change the semantics) is not in the spirit of MPI, thus, we can't use Info for neither asserting that each process specifies the whole graph or all processes specified reorder or not.

Reorder collective or not?

|| collective || not collective || || 5 || 6 ||

It's close, but not collective is also more consistent with the current standard. We can change this in MPI-3 easily (while keeping downward compatibility).

Description

The graph interface in MPI-1 has some serious (and unnecessary) scalability issues that prohibit efficient usage on large-scale machines.

The new scalable interface is essentially a single function, that is complemented by two inquiry functions. The concrete proposal introduces two new constants, and entails a few other changes related to the topology functionality.

See additional proposal/discussion in Ticket #147.

History

This interface was introduced in MPI-1 designed based on some older parallel library (Alexander Supalov can provide details). However, the interface was never changed. Right now, each process has to know the whole topology during graph creation and every process can query the neighbors of every other process in the communicator with a local operation. This makes it necessary to store the whole topology at every process locally which might be a prohibitive waste of memory resources.

The proposal is to introduce new functionality for specifying topologies by distributed graphs, and deprecate the current MPI 2.1 graph topology functionality.

Proposed Solution

Page 15, line 5, add:

MPI_UNWEIGHTED

Page 195, line 27, add:

MPI::Distgraphcomm MPI::Distgraphcomm::Dup() const

Page 195, line 35, add:

MPI::Distgraphcomm MPI::Distgraphcomm::Clone() const

Page 242:15-22 - remove sentence

"Edges in the communication graph are not weighted, so that processes are either simply connected or not connected at all."

and (outdated) rationale.

Page 243, line 3: Replace

"The functions MPI_GRAPH_CREATE and MPI_CART_CREATE are used"

"The functions MPI_GRAPH_CREATE, MPI_DIST_GRAPH_CREATE_ADJACENT, MPI_DIST_GRAPH_CREATE and MPI_CART_CREATE are used"

Page 243, line 8, replace:

"All input arguments must have identical values on all processes of the group of comm_old. A new communicator ..."

by:

"For MPI_GRAPH_CREATE and MPI_CART_CREATE, all input arguments must have identical values on all processes of the group of comm_old. For MPI_DIST_GRAPH_CREATE_ADJACENT and MPI_DIST_GRAPH_CREATE the input communication graph is distributed across the calling processes. Therefore the processes provide different values for the arguments specifying the graph. However, all processes must give the same value for reorder and the info argument. In all cases, a new communicator ..."

Page 243, line 33 (before last sentence), add:

For distributed graphs, the functions MPI_DIST_NEIGHBORS_COUNT and MPI_DIST_NEIGHBORS can be used to extract the neighbors of the calling node.

Page 247, after line 48, add new section:

7.5.4 Distributed (Graph) Constructor

The general graph constructor assumes that each process passes the full (global) communication graph to the call. This limits the scalability of this constructor. With the distributed graph interface, the communication graph is specified in a fully distributed fashion. Each process specifies only the part of the communication graph of which it is aware. Typically, this could be the set of processes from which the process will eventually receive or get data, or the set of processes to which the process will send or put data, or some combination of such edges. Two different interfaces can be used to create a distributed graph topology. MPI_DIST_GRAPH_CREATE_ADJACENT creates a distributed graph communicator with each process specifying all of its incoming and outgoing (adjacent) edges in the logical communication graph and thus requires minimal communication during creation. MPI_DIST_GRAPH_CREATE provides full flexibility, and processes can indicate that communication will occur between other pairs of processes.

To provide better possibilities for optimization by the MPI library, the distributed graph constructors permit weighted communication edges and take an info argument that can further influence process reordering or other optimizations performed by the MPI library. For example, hints can be provided on how edge weights are to be interpreted, the quality of the reordering, and/or the time permitted for the MPI library to process the graph.

#!c
MPI_DIST_GRAPH_CREATE_ADJACENT(comm_old, indegree, sources, sourceweights, 
outdegree, destinations, destweights, info, reorder, comm_dist_graph)
 IN      comm_old        input communicator (handle)
 IN      indegree        size of sources and sourceweights arrays 
                         (non-negative integer)
 IN      sources         ranks of processes for which the calling 
                         process is a destination 
                         (array of non-negative integers)
 IN      sourceweights   weights of the edges into the calling 
                         process (array of non-negative integers)
 IN      outdegree       size of destinations and destweights arrays 
                         (non-negative integer)
 IN      destinations    ranks of processes for which the calling 
                         process is a source
                         (array of non-negative integers)
 IN      destweights     weights of the edges out of the calling 
                         process (array of non-negative integers)
 IN      info            hints on optimization and interpretation of 
                         weights (handle)
 IN      reorder         the ranks may be reordered (true) or 
                         not (false) (logical)
 OUT     comm_dist_graph communicator with distributed graph 
                        topology (handle)

int MPI_Dist_graph_create_adjacent(MPI_Comm comm_old, 
          int indegree, int sources[], int sourceweights[],
          int outdegree, int destinations[], int destweights[],
          MPI_Info info, int reorder, MPI_Comm *comm_dist_graph)

MPI_DIST_GRAPH_CREATE_ADJACENT(COMM_OLD, INDEGREE, SOURCES, SOURCEWEIGHTS, 
OUTDEGREE, DESTINATIONS, DESTWEIGHTS, INFO, REORDER, COMM_DIST_GRAPH, 
IERROR)
    INTEGER COMM_OLD, INDEGREE, SOURCES(*), SOURCEWEIGHTS(*), 
    OUTDEGREE, DESTINATIONS(*), DESTWEIGHTS(*), INFO, 
    COMM_DIST_GRAPH, IERROR
    LOGICAL REORDER

MPI::Distgraphcomm MPI::Intracomm::Dist_graph_create_adjacent(int indegree, 
                   const int sources[], const int sourceweights[],
                   int outdegree, const int destinations[],
                   const int destweights[],
                   const MPI::Info& info, bool reorder) const

MPI::Distgraphcomm MPI::Intracomm::Dist_graph_create_adjacent(int indegree, 
                   const int sources[], int outdegree, 
                   const int destinations[], const MPI::Info& info, 
                   bool reorder) const

MPI_DIST_GRAPH_CREATE_ADJACENT returns a handle to a new communicator to which the distributed graph topology information is attached.
Each process passes all information about the edges to its neighbors in the virtual distributed graph topology. The calling processes must ensure that each edge of the graph is described in the source and in the destination process with the same weights. If there are multiple edges for a given (source,dest) pair, then the sequence of the weights of these edges does not matter. The complete communication topology is the combination of all edges shown in the sources arrays of all processes in comm_old, which must be identical to the combination of all edges shown in the destinations arrays. Source and destination ranks must be process ranks of comm_old. This allows a fully distributed specification of the communication graph. Isolated processes (i.e., processes with no outgoing or incoming edges, that is, processes that have specified indegree and outdegree as zero and that thus do not occur as source or destination rank in the graph specification) are allowed.

The call creates a new communicator comm_dist_graph of distributed graph topology type to which topology information has been attached. The number of processes in comm_dist_graph is identical to the number of processes in comm_old. The call to MPI_DIST_GRAPH_CREATE_ADJACENT is collective.

Weights are specified as non-negative integers and can be used to influence the process remapping strategy and other internal MPI optimizations. For instance, approximate count arguments of later communication calls along specific edges could be used as their edge weights. Multiplicity of edges can likewise indicate more intense communication between pairs of processes. However, the exact meaning of edge weights is not specified by the MPI standard and is left to the implementation. In C or Fortran, an application can supply the special value MPI_UNWEIGHTED for both weight arrays to indicate that all edges have the same (effectively no) weight. In C++, this constant does not exist and the weight arguments may be omitted from the argument list. It is erroneous to supply MPI_UNWEIGHTED, or in C++ omit the weight arrays, for some but not all processes' sourceweights and destweights arrays in comm_old. Note that MPI_UNWEIGHTED is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be a NULL. In Fortran, MPI_UNWEIGHTED is an object like MPI_BOTTOM (not usable for initialization or assignment). See Section 2.5.4.

The meaning of the info and reorder arguments is defined in the description of the following routine.

#!c
MPI_DIST_GRAPH_CREATE(comm_old, n, sources, degrees, 
destinations, weights, info, reorder, comm_dist_graph)
 IN  comm_old        input communicator (handle)
 IN  n               number of source nodes for which this process
                     specifies edges (non-negative integer)
 IN  sources         array containing the n source nodes for which
                     this process specifies edges 
                     (array of non-negative integers)
 IN  degrees         array specifying the number of destinations for
                     each source node in the source node array 
                     (array of non-negative integers)
 IN  destinations    destination nodes for the source nodes in the 
                     source node array (array of non-negative 
                     integers)
 IN  weights         weights for source to destination edges 
                     (array of non-negative integers)
 IN  info            hints on optimization and interpretation of 
                     weights (handle)
 IN  reorder         the process may be reordered (true) or 
                     not (false) (logical)
 OUT comm_dist_graph communicator with distributed graph topology 
                   added (handle)

int MPI_Dist_graph_create(MPI_Comm comm_old, int n, 
          int sources[], int degrees[], int destinations[],
          int weights[], MPI_Info info, int reorder, 
          MPI_Comm *comm_dist_graph)

MPI_DIST_GRAPH_CREATE(COMM_OLD, N, SOURCES, DEGREES, 
DESTINATIONS, WEIGHTS, INFO, REORDER, COMM_DIST_GRAPH, IERROR)
    INTEGER COMM_OLD, N, SOURCES(*), DEGREES(*), DESTINATIONS(*), 
    WEIGHTS(*), INFO, COMM_DIST_GRAPH, IERROR
    LOGICAL REORDER

MPI::Distgraphcomm MPI::Intracomm::Dist_graph_create(int n, 
                   const int sources[], const int degrees[],
                   const int destinations[], const int weights[], 
                   const MPI::Info& info, bool reorder) const

MPI::Distgraphcomm MPI::Intracomm::Dist_graph_create(int n, 
                   const int sources[], const int degrees[],
                   const int destinations[], const MPI::Info& info, 
                   bool reorder) const

MPI_DIST_GRAPH_CREATE returns a handle to a new communicator to which the distributed graph topology information is attached. Concretely, each process calls the constructor with a set of directed (source,destination) communication edges as described below. Every process passes an array of n source nodes in the sources array. For each source node, a non-negative number of destination nodes is specified in the degrees array. The destination nodes are stored in the corresponding consecutive segment of the destinations array. More precisely if the i-th node in sources is s, this specifies degrees[i] edges (s,d) with d of the j-th such edge stored in destinations[degrees[0]+...+degrees[i-1]+j]. The weight of this edge is stored in weights[degrees[0]+...+degrees[i-1]+j]. Both the sources and the destinations arrays may contain the same node more than once, and the order in which nodes are listed as destinations or sources is not significant. Similarly, different processes may specify edges with the same source and destination nodes. Source and destination nodes must be process ranks of comm_old. Different processes may specify different numbers of source and destination nodes, as well as different source to destination edges. This allows a fully distributed specification of the communication graph. Isolated processes (i.e., processes with no outgoing or incoming edges, that is, processes that do not occur as source or destination node in the graph specification) are allowed.

If reorder = false, all processes will have the same rank in comm_dist_graph as in comm_old. If reorder = true then the MPI library is free to remap to other processes (of comm_old) in order to improve communication on the edges of the communication graph. The weight associated with each edge is a hint to the MPI library about the amount or intensity of communication on that edge, and may be used to compute a "best" reordering.

Weights are specified as non-negative integers and can be used to influence the process remapping strategy and other internal MPI optimizations. For instance, approximate count arguments of later communication calls along specific edges could be used as their edge weights. Multiplicity of edges can likewise indicate more intense communication between pairs of processes. However, the exact meaning of edge weights is not specified by the MPI standard and is left to the implementation. In C or Fortran, an application can supply the special value MPI_UNWEIGHTED for the weight array to indicate that all edges have the same (effectively no) weight. In C++, this constant does not exist and the weights argument may be omitted from the argument list. It is erroneous to supply MPI_UNWEIGHTED, or in C++ omit the weight arrays, for some but not all processes of comm_old. Note that MPI_UNWEIGHTED is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be a NULL. In Fortran, MPI_UNWEIGHTED is an object like MPI_BOTTOM (not usable for initialization or assignment). See Section 2.5.4.

The meaning of the weights argument can be influenced by the info argument. Info arguments can be used to guide the mapping; possible options include minimizing the maximum number of edges between processes on different SMP nodes, or minimizing the sum of all such edges. An MPI implementation is not obliged to follow specific hints, and it is legal for an MPI implementation not to do any reordering. An MPI implementation may specify more info key-value pairs. All processes must specify the same set of key-value info pairs.

-Advice to implementors.* MPI implementations must document any additionally supported key-value info pairs. MPI_INFO_NULL is always valid, and may indicate the default creation of the distributed graph topology to the MPI library.

An implementation does not explicitly need to construct the topology from its distributed parts. However, all processes can construct the full topology from the distributed specification and use this in a call to MPI_GRAPH_CREATE to create the topology. This may serve as a reference implementation of the functionality, and may be acceptable for small communicators. However, a scalable high-quality implementation would save the topology graph in a distributed way. (End of advice to implementors.)

Example 7.3
As for Example 7.2., assume there are four processes 0, 1, 2, 3 with the following adjacency matrix and unit edge weights:

|| process || neighbors || || 0 || 1, 3 || || 1 || 0 || || 2 || 3 || || 3 || 0, 2 ||

With MPI_DIST_GRAPH_CREATE, this graph could be constructed in many different ways. One way would be that each process specifies its outgoing edges. The arguments per process would be:

|| process || n || sources || degrees || destinations || weights || || 0 || 1 || 0 || 2 || 1,3 || 1,1 || || 1 || 1 || 1 || 1 || 0 || 1 || || 2 || 1 || 2 || 1 || 3 || 1 || || 3 || 1 || 3 || 2 || 0,2 || 1,1 ||

Another way would be to pass the whole graph on process 0, which could be done with the following arguments per process: || process || n || sources || degrees || destinations || weights || || 0 || 4 || 0,1,2,3 || 2,1,1,2 || 1,3,0,3,0,2 || 1,1,1,1,1,1 || || 1 || 0 || - || - || - || - || || 2 || 0 || - || - || - || - || || 3 || 0 || - || - || - || ||

In both cases above, the application could supply MPI_UNWEIGHTED instead of explicitly providing identical weights.

MPI_DIST_GRAPH_CREATE_ADJACENT could be used to specify this graph using the following arguments: || process || indegree || sources || sourceweights || outdegree || destinations || destweights || || 0 || 2 || 1,3 || 1,1 || 2 || 1,3 || 1,1 || || 1 || 1 || 0 || 1 || 1 || 0 || 1 || || 2 || 1 || 3 || 1 || 1 || 3 || 1 || || 3 || 2 || 0,2 || 1,1 || 2 || 0,2 || 1,1 ||

Example 7.4
A two-dimensional PxQ torus where all processes communicate along the dimensions and along the diagonal edges. This cannot be modelled with Cartesian topologies, but can easily be captured with MPI_DIST_GRAPH_CREATE as shown in the following code. In this example, the communication along the dimensions is twice as heavy as the communication along the diagonals:

#!c
/*
Input:     dimensions P, Q
Condition: number of processes equal to P*Q; otherwise only 
           ranks smaller than P*Q participate
*/
int rank, x, y;
int sources[1], degrees[1];
int destinations[8], weights[8];

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

/* get x and y dimension */
y=rank/P; x=rank%P;

/* get my communication partners along x dimension */
destinations[0] # P*y+(x+1)%P; weights[0]2;
destinations[1] # P*y+(P+x-1)%P; weights[1]2;

/* get my communication partners along y dimension */
destinations[2] # P*((y+1)%Q)+x; weights[2]2;
destinations[3] # P*((Q+y-1)%Q)+x; weights[3]2;

/* get my communication partners along diagonals */
destinations[4] # P*((y+1)%Q)+(x+1)%P; weights[4]1;
destinations[5] # P*((Q+y-1)%Q)+(x+1)%P; weights[5]1;
destinations[6] # P*((y+1)%Q)+(P+x-1)%P; weights[6]1;
destinations[7] # P*((Q+y-1)%Q)+(P+x-1)%P; weights[7]1;

sources[0] = rank;
degrees[0] = 8;
MPI_Dist_graph_create(MPI_COMM_WORLD, 1, sources, degrees, destinations,
  weights, MPI_INFO_NULL, 1, comm_dist_graph)

Page 248, line 21, add (i.e., after MPI_GRAPH):

MPI_DIST_GRAPH distributed graph topology

Page 252, line 6, add:

MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS provide adjacency information for a distributed graph topology.

#!c
MPI_DIST_GRAPH_NEIGHBORS_COUNT(comm, indegree, outdegree, weighted)
 IN      comm          communicator with distributed graph topology (handle)
 OUT     indegree      number of edges into this process (non-negative 
                       integer)
 OUT     outdegree     number of edges out of this process (non-negative 
                       integer)
 OUT     weighted      false if MPI_UNWEIGHTED was supplied during creation, 
                       true otherwise (logical)

int MPI_Dist_graph_neighbors_count(MPI_Comm comm, int *indegree, 
    int *outdegree, int *weighted)

MPI_DIST_GRAPH_NEIGHBORS_COUNT(COMM, INDEGREE, OUTDEGREE, WEIGHTED, IERROR)
  INTEGER COMM, INDEGREE, OUTDEGREE, IERROR
  LOGICAL WEIGHTED

void MPI::Distgraphcomm::Get_dist_neighbors_count(int rank, int indegree[], 
    int outdegree[], bool& weighted) const

#!c
MPI_DIST_GRAPH_NEIGHBORS(comm, maxindegree, sources, sourceweights, 
                         maxoutdegree, destinations, destweights)
 IN      comm           communicator with distributed graph topology (handle)
 IN      maxindegree    size of sources and sourceweights arrays 
                        (non-negative integer)
 OUT     sources        processes for which the calling process is a 
                        destination (array of non-negative integers)
 OUT     sourceweights  weights of the edges into the calling process 
                        (array of non-negative integers)
 IN      maxoutdegree   size of destinations and destweights arrays 
                        (non-negative integer)
 OUT     destinations   processes for which the calling process is a source
                        (array of non-negative integers)
 OUT     destweights    weights of the edges out of the calling process 
                        (array of non-negative integers)

int MPI_Dist_graph_neighbors(MPI_Comm comm, int maxindegree, int sources[], 
                             int sourceweights[], int maxoutdegree, 
                             int destinations[], int destweights[])

MPI_DIST_GRAPH_NEIGHBORS(COMM, MAXINDEGREE, SOURCES, SOURCEWEIGHTS, 
                         MAXOUTDEGREE, DESTINATIONS, DESTWEIGHTS, IERROR)
  INTEGER COMM, MAXINDEGREE, SOURCES(*), SOURCEWEIGHTS(*), MAXOUTDEGREE, 
                         OUTDEGREE, DESTINATIONS(*), DESTWEIGHTS(*), IERROR

void MPI::Distgraphcomm::Get_dist_neighbors(int maxindegree, int sources[],
                                        int sourceweights[], 
                                        int maxoutdegree, 
                                        int destinations[],
                                        int destweights[])

These calls are local. The number of edges into and out of the process returned by MPI_DIST_GRAPH_NEIGHBORS_COUNT are the total number of such edges given in the call to MPI_DIST_GRAPH_CREATE_ADJACENT or MPI_DIST_GRAPH_CREATE (potentially by processes other than the calling process in the case of MPI_DIST_GRAPH_CREATE). Multiply defined edges are all counted and returned by MPI_DIST_GRAPH_NEIGHBORS in some order. If MPI_UNWEIGHTED is supplied for sourceweights or destweights or both, or if MPI_UNWEIGHTED was supplied during the construction of the graph then no weight information is returned in that array or those arrays. The only requirement on the order of values in sources and destinations is that two calls to the routine with same input argument comm will return the same sequence of edges. If maxindegree or maxoutdegree is smaller than the numbers returned by MPI_DIST_GRAPH_NEIGHBOR_COUNT, then only the first part of the full list is returned. Note, that the order of returned edges does need not to be identical to the order that was provided in the creation of comm for the case that MPI_DIST_GRAPH_CREATE_ADJACENT was used.

-Advice to implementors:* Since the query calls are defined to be local, each process needs to store the list of its neighbors with incoming and outgoing edges. Communication is required at the collective MPI_DIST_GRAPH_CREATE call in order to compute the neighbor lists for each process from the distributed graph specification. (End of advice to implementors.)

Page 450, line 13, add:

class Distgraphcomm : public Intracomm {...};

Page 456, line 1:

replace: "five" with "six"

Page 456, line 5:

replace "MPI::Cartcomm and MPI::Graphcomm are derived from MPI::Intracomm."

with

MPI::Cartcomm, MPI::Graphcomm, and MPI::Distgraphcomm are derived from MPI::Intracomm.

Page 457, line 11, add:

Distgraphcomm& Distgraphcomm::Clone() const

Page 457, line 34, replace:

// Cartcomm and Graphcomm are similarly defined

with:

// Cartcomm, Graphcomm, and Distgraphcomm are similarly defined

Page 462, line 36, replace: "MPI_STATUS_IGNORE, MPI_STATUSES_IGNORE, MPI_ERRCODES_IGNORE,"

with

"MPI_STATUS_IGNORE, MPI_STATUSES_IGNORE, MPI_ERRCODES_IGNORE, MPI_UNWEIGHTED,"

Page 465, line 14, 15: addressed in #104

by:

// Cartcomm, Graphcomm and Distgraphcomm are similarly defined

Page 497 (Appendix A), line 9, add:

MPI_DIST_GRAPH MPI::DIST_GRAPH

Page 499, after line 12:

add "MPI_UNWEIGHTED Not defined in C++"

Page 500, line 6, add:

MPI::Distgraphcomm

Page 550, line 27, add:

Distgraphcomm& Distgraphcomm::Clone() const

Page 551, line 13, add:

Distgraphcomm& Distgraphcomm::Dup() const

Functions to Deprecate (not included in this ticket but kept for reference!)

The following functions may be deprecated by this proposal (for MPI 2.2 or MPI 3.0). The proposal is not dependent on whether this happens or not, and the discussion on this should be kept separate.

#!c
MPI_GRAPH_CREATE(comm_old, nnodes, index, edges, reorder, comm_dist_graph)
 IN  comm_old        input communicator (handle)
 IN  nnodes          number of nodes in graph (integer)
 IN  index           array of integers describing node degrees (see below)
 IN  edges           array of integers describing graph edges (see below)
 IN  reorder         ranking may be reordered (true) or not (false) (logical)
 OUT comm_dist_graph communicator with graph topology added (handle)

#!c
MPI_GRAPH_MAP(comm, nnodes, index, edges, newrank)
 IN      comm                      input communicator (handle)
 IN      nnodes                    number of graph nodes (integer)
 IN      index                     integer array specifying the graph structure, see MPI_GRAPH_CREATE
 IN      edges                     integer array specifying the graph structure
 OUT     newrank                   reordered rank of the calling process;
                                   MPI_UNDEFINED if the calling process does not belong to graph (integer)

#!c
MPI_GRAPH_NEIGHBORS(comm, rank, maxneighbors, neighbors)
 IN      comm                    communicator with graph topology (handle)
 IN      rank                    rank of process in group of comm (integer)
 IN      maxneighbors            size of array neighbors (integer)
 OUT     neighbors               ranks of processes that are neighbors to specified process (array of integer)

#!c
MPI_GRAPH_NEIGHBORS_COUNT(comm, rank, nneighbors)
 IN      comm                 communicator with graph topology (handle)
 IN      rank                 rank of process in group of comm (integer)
 OUT     nneighbors           number of neighbors of specified process (integer)

#!c
MPI_GRAPH_GET(comm, maxindex, maxedges, index, edges)
 IN      comm                     communicator with graph structure (handle)
 IN      maxindex                 length of vector index in the calling program
                                  (integer)
 IN      maxedges                 length of vector edges in the calling program
                                  (integer)
 OUT     index                    array of integers containing the graph structure (for
                                  details see the definition of MPI_GRAPH_CREATE)
 OUT     edges                    array of integers containing the graph structure

#!c
MPI_GRAPHDIMS_GET(comm, nnodes, nedges)
 IN      comm                    communicator for group with graph structure (handle)
 OUT     nnodes                  number of nodes in graph (integer) (same as number
                                 of processes in the group)
 OUT     nedges                  number of edges in graph (integer)

Since MPI_DIST_GRAPH_CREATE is collective, it would be a small matter to compute the total number of edges. Thus, MPI_GRAPHDIMS_GET could be kept. Note that nnodes is equal to the number of processes in comm.

Possible additional functionality (not included in this ticket but kept for future reference)

To be discussed:

#!c
MPI_DIST_EDGE_COUNT(comm, edges)
 IN comm              communicator with distributed graph topology (handle)
 OUT edges (integer)  total number of edges in distributed graph

Impact on Implementations

Implementations have to add the new functions, which can easily be implemented on top of the existing, general graph interface. This trivial implementation requires little work.

Impact on Applications / Users

Users who want to use the new functionality will have to change calls to graph creation and neighbor query functions. They have to introduce explicit communication of neighbor lists if they query other processes neighbors (this is unlikely).

Alternative Solutions

For more discussion/connection to topological collectives, see [wiki:TopoColl]

Keep for later (maybe introduction in MPI-3): Advice to users. Specifying reorder # false on only some processes can support heterogeneous application to hardware mapping. For example, suppose an application does visualization or heavy I/O on rank 0, and that only this processor has the corresponding visualization or I/O capabilities. If this rank should be part of a distributed graph topology, it is useful to specify that this rank will not be remapped. Allowing remapping of other ranks can then exploit features of the interconnection network topology. Another scenario is an application where some ranks have large state information that would be (too) expensive to migrate. Setting reorder``false for these ranks guarantees that migration will not be necessary since these processes will preserve their rank in the distributed graph communicator. (End of advice to users.)

-Advice to implementors.* The reordering optimization problem must account for the constraint that processes with reorder = false must remain fixed. This could be done by eliminating the fixed nodes and all their edges, and solving the optimization problem on the reduced graph. However, better results may be possible by considering the entire graph. (End of advice to implementors.)

Entry for the Change Log

Section 7.5.3 on page 247. New functions for a scalable distributed graph topology interface has been added. In this section, the functions MPI_DIST_GRAPH_CREATE_ADJACENT and MPI_DIST_GRAPH_CREATE, the constants MPI_UNWEIGHTED, and the derived C++ class Distgraphcomm were added.

Section 7.5.4, page 252. For the scalable distributed graph topology interface, the functions MPI_DIST_NEIGHBORS_COUNT and MPI_DIST_NEIGHBORS and the constant MPI_DIST_GRAPH were added.

Originally by gropp on 2008-10-26 10:34:52 -0500

A specific proposal (all text, page and line numbers) as well as an open source implementation is required.

Originally by htor on 2009-01-10 09:48:38 -0600

fixed typo in description

Originally by htor on 2009-01-10 10:06:46 -0600

Updated proposal.

Originally by htor on 2009-02-04 12:56:06 -0600

please review!

Originally by moody20 on 2009-02-04 18:55:14 -0600

A very important improvement, but there's a lot of new material here... it'll will take some time to ponder it all.

Originally by bronis on 2009-02-04 19:05:01 -0600

OK, initial review comments:

The weights look like they could be confusing and the hints seem to make this unnecessarily complex; what is the default for weights and hints? Are there any limitations on the values that can supplied as weights?
Some examples are definitely needed;
This phrase:

For each source node a non-negative number of target nodes are given as specified in the degrees array

should be reworded slightly to:

For each source node a non-negative number of target nodes are specified in the degrees array

"i'th" and "j'th" should not use an apostrophe; instead, superscript the "th";
"e.g." needs to be folowed by a comma -- "e.g.,";
Change "non-decreasing" to "monotonically increasing";
Change "the comm_old communicator" to "the communicator comm_old";
To what order does: "be the list in increasing order of different nodes that appear in either sources or targets arrays of the calling processes" refer? Order of appearance in the arrays? Rank in comm_old? I do not know the answer so it needs to be clearer;
Is the size of comm_graph always equal to the size of comm_old? If so, this should be stated clearly early in the description. If not, does the call return a copy of MPI_COMM_SELF to nodes that are not the source or target of any edge? MPI_COMM_NULL? If not, how can reorder be set to false?
Change "and maybe be used" to "and may be used"
What does this: "Edge weights can for instance be taken from the count arguments later used in communication calls." mean? Who does the taking? How can the application or the implementation take the weights from later calls? It would be best to reword this to use active voice and then meaning of later might be clearer;
I think the arguments need to be annotated as "non-negative" or such when appropriate; types are also needed for the arguments of MPI_DIST_GRAPH_CREATE;
Create and advice to implementers (or maybe users) that captures this thought: "However, the exact meaning of edge weights is not specified by the MPI standard, and left to the implementation." Include what might be reasonable. Perhaps most of the weights discussion should go into that advice;
Where are the actual function prototypes (i.e., the language bindings)?
Please change: "It is not mandatory for an MPI implementation to actually construct explicitly the topology from its distributed parts." to "An implementation need not explicitly construct the topology from its distributed parts."
Delete "Although it defeats the purpose of the distributed graph interface,"; instead end that advice to implementers with a statement about "However, a high quality implementation would..."
Descriptions of MPI_DIST_NEIGHBORS and MPI_DIST_NEIGHBORS_COUNT are needed;
I do not understand this comment: "Since MPI_Graph_create is collective, it would be a small matter to compute the total number of edges. Thus, this functionality could be kept. Note that nnodes is equal to the number of processes in comm." To which nnodes do you refer? What does it mean to keep this functionality?

Originally by rsthakur on 2009-02-08 15:25:28 -0600

This will benefit from having an example like Example 7.2 pg 246-247

Originally by rsthakur on 2009-02-08 15:48:17 -0600

The proposal should specify what are the sizes of the targets and weights arrays.

When reading the first paragraph beginning with "The virtual communication topology is specified...", I found myself asking: "What is the weight of an edge supposed to mean and what are valid values of weights?" This is somewhat answered two paragraphs below, but it is still not clear what values of weights one should use and how to choose them. An example would be good.

Instead of "Source and target nodes must all be greater than or equal to 0 and smaller than the size of the comm_old communicator." perhaps one can say "Source and target nodes must be members of comm_old."

I feel this should be moved to MPI 3.

I will need some time to digest this.

Originally by moody20 on 2009-02-09 16:09:31 -0600

I also suggest we move this to 3.0. It's very important we make this change, but it's just too much to understand it all before the 2.2 deadline.

Originally by htor on 2009-02-28 21:24:44 -0600

Replying to bronis:

OK, initial review comments:

The weights look like they could be confusing and the hints seem to make this unnecessarily complex; what is the default for weights and hints? Are there any limitations on the values that can supplied as weights? there are no defaults (it's a usual input array). Limits are simply 0..2^{sizeof(int)-1} (the positive integer range). Negative weights would not make too much sense in a communication context (how would one define negative communication?). It's now clarified!

Some examples are definitely needed; done, let me know if you want more

This phrase:

For each source node a non-negative number of target nodes are given as specified in the degrees array

should be reworded slightly to:

For each source node a non-negative number of target nodes are specified in the degrees array fixed

"i'th" and "j'th" should not use an apostrophe; instead, superscript the "th"; fixed

"e.g." needs to be folowed by a comma -- "e.g.,"; fixed

Change "non-decreasing" to "monotonically increasing"; I don't see a difference (both phrases are established in literature about discrete mathematics). Is there any good reason for your proposal?

Change "the comm_old communicator" to "the communicator comm_old"; fixed

To what order does: "be the list in increasing order of different nodes that appear in either sources or targets arrays of the calling processes" refer? Order of appearance in the arrays? Rank in comm_old? I do not know the answer so it needs to be clearer; fixed

Is the size of comm_graph always equal to the size of comm_old? If so, this should be stated clearly early in the description. If not, does the call return a copy of MPI_COMM_SELF to nodes that are not the source or target of any edge? MPI_COMM_NULL? If not, how can reorder be set to false? fixed

Change "and maybe be used" to "and may be used" fixed

What does this: "Edge weights can for instance be taken from the count arguments later used in communication calls." mean? Who does the taking? How can the application or the implementation take the weights from later calls? It would be best to reword this to use active voice and then meaning of later might be clearer; fixed

I think the arguments need to be annotated as "non-negative" or such when appropriate; types are also needed for the arguments of MPI_DIST_GRAPH_CREATE; fixed

Create and advice to implementers (or maybe users) that captures this thought: "However, the exact meaning of edge weights is not specified by the MPI standard, and left to the implementation." Include what might be reasonable. Perhaps most of the weights discussion should go into that advice; fixed (I added it to the main text, we can argue if this should be an advice - but the interface is not optional or such, but it could be easily filled with unit weights).

Where are the actual function prototypes (i.e., the language bindings)? fixed

Please change: "It is not mandatory for an MPI implementation to actually construct explicitly the topology from its distributed parts." to "An implementation need not explicitly construct the topology from its distributed parts." fixed

Delete "Although it defeats the purpose of the distributed graph interface,"; instead end that advice to implementers with a statement about "However, a high quality implementation would..." fixed

Descriptions of MPI_DIST_NEIGHBORS and MPI_DIST_NEIGHBORS_COUNT are needed; why? The model functions (MPI_Graph_neighbor{,_count}) are not described either (they're described in the intro).

I do not understand this comment: "Since MPI_Graph_create is collective, it would be a small matter to compute the total number of edges. Thus, this functionality could be kept. Note that nnodes is equal to the number of processes in comm." To which nnodes do you refer? What does it mean to keep this functionality? this (old) interface doesn't really make too much sense anymore. This could also be implemented by the user after graph construction (trivially).

Originally by htor on 2009-02-28 21:27:02 -0600

Replying to rsthakur:

The proposal should specify what are the sizes of the targets and weights arrays. this should be more clear now.

When reading the first paragraph beginning with "The virtual communication topology is specified...", I found myself asking: "What is the weight of an edge supposed to mean and what are valid values of weights?" This is somewhat answered two paragraphs below, but it is still not clear what values of weights one should use and how to choose them. An example would be good. yes, I hope the new version makes it more clear (there is a separate paragraph just about weights)

Instead of "Source and target nodes must all be greater than or equal to 0 and smaller than the size of the comm_old communicator." perhaps one can say "Source and target nodes must be members of comm_old." fixed

Originally by htor on 2009-02-28 21:29:50 -0600

Attachment added: virtual_graph.c (6.1 KiB)

Originally by htor on 2009-02-28 21:31:02 -0600

Implementation finished and attached and proposal updated. Please review!

Thanks, Torsten

Originally by bronis on 2009-03-01 12:46:55 -0600

I'll start with substantive comments and then provide editing comments.

You state "If reorder is set to false then process i in comm_graph will be the same process as s_i in the calling communicator comm_old." I still do not see how you enforce this if some process j < i is omitted from comm_graph.
Your example demonstrates why I think there needs to be some constant that indicates that the weights are all identical; I suggest MPI_IDENTICAL_WEIGHTS. That was my point in asking the question in the first place. I think the common case will be that the user does not care about the weights used. In fact, I think you could make a case for omitting them but having an easy default would be about as good and would allow the flexibility of specifying them in the less common case where the user is willing to specify ones with meaning.
I think you may want to add a statement to the adviceto implementers about whatr must be stored locally. In particular, each process must locally store the lists of sources, source weights, targets and target weights since that the MPI_DIST_GRAPH_NEIGHBORS function is defined with local semantics.

Editing comments:

The FORTRAN prototype has the wrong function name. You need to change "MPI_GRAPH_CREATE" to "MPI_DIST_GRAPH_CREATE".
Is there any good reason to omit "const" from the C binding? You have it in the C++ binding. The incongruity seems odd. I suggest you add it where appropriate.
Please change "the order in which nodes are listed as targets or source" to "the order in which nodes are listed as targets or sources" (add an 's' to "source" to be consistent).
Please change "Processes not in K are returned some processes are returned MPI_COMM_NULL" to "Processes not in K are returned MPI_COMM_NULL".
Please change "The new distributed graph communicator comm_graph consist of k processes." to "The new distributed graph communicator comm_graph consists of k processes." (add an 's' to "consist" for agreement).
Please change "implementation not to do any reordering at all" to "implementation not to do any reordering" (delete "at all", which is unnecessary).
The statement "The boolean reorder must take the same value on all processes in the call." reads oddly. No similar statement appears for the other topology communicators. If you insist on retaining it, I suggest you rewrite it as "It is erroneous to specify different values of reorder for matching calls of MPI_DIST_GRAPH_CREATE."
Please change "topology which visits" to "topology that visits" since the clause is required.
The 'H' in "Hamiltonian" should be upper case.
Something seems wrong in this C++ prototype:

std::pair<int,int> MPI::Graphcomm::Get_dist_neighbors_count(int rank) const

Shouldn't it just be:

void MPI::Graphcomm::Get_dist_neighbors_count(int indegre[], int outdegree[]) const

Please change "MPI_DIST_NEIGHBORS" to "MPI_DIST_GRAPH_NEIGHBORS" in the language neutral prototype.
Please change "array of non-negative integer" to "array of non-negative integers" (add an 's') four times in the language neutral description of MPI_DIST_GRAPH_NEIGHBORS.
Please change "targetweights" to "targetweights" (add an 's').
Please change "processes that the calling process have as target" to "processes for which the calling process is a source" so it is symmetric with the description of sources.
In the statements "Since MPI_Graph_create is collective, it would be a small matter to compute the total number of edges. Thus, this functionality could be kept. Note that nnodes is equal to the number of processes in comm." I think you meant "MPI_DIST_GRAPH_CREATE". In any event, "this functionality is ambiguous. Did you mean "Since MPI_DIST_GRAPH_CREATE is collective, it would be a small matter to compute the total number of edges. Thus, MPI_GRAPHDIMS_GET could be kept. Note that nnodes is equal to the number of processes in comm." If we did keep it, wouldn't MPI_DIST_EDGE_COUNT be unnecessary?

Originally by traff on 2009-03-04 08:47:00 -0600

Some clarification/improvement still needed. I'll try to make my update in the next few days

Jesper

Originally by htor on 2009-03-04 17:01:53 -0600

Replying to bronis:

I'll start with substantive comments and then provide editing comments.

You state "If reorder is set to false then process i in comm_graph will be the same process as s_i in the calling communicator comm_old." I still do not see how you enforce this if some process j < i is omitted from comm_graph. oh yes, I see. I fixed it - but now it reads a bit more confusing (as I talk about ordinal ordering). However, it should be precisely defined.

Your example demonstrates why I think there needs to be some constant that indicates that the weights are all identical; I suggest MPI_IDENTICAL_WEIGHTS. That was my point in asking the question in the first place. I think the common case will be that the user does not care about the weights used. In fact, I think you could make a case for omitting them but having an easy default would be about as good and would allow the flexibility of specifying them in the less common case where the user is willing to specify ones with meaning. good point - I added MPI_UNWEIGHTED (is shorter) and updated the example

I think you may want to add a statement to the adviceto implementers about whatr must be stored locally. In particular, each process must locally store the lists of sources, source weights, targets and target weights since that the MPI_DIST_GRAPH_NEIGHBORS function is defined with local semantics. ok, done

Editing comments:

The FORTRAN prototype has the wrong function name. You need to change "MPI_GRAPH_CREATE" to "MPI_DIST_GRAPH_CREATE". fixed

Is there any good reason to omit "const" from the C binding? You have it in the C++ binding. The incongruity seems odd. I suggest you add it where appropriate. hmm, the const proposal isn't through yet. So I just stay consistent with the rest of MPI (this needs to be updated if the proposal makes it).

Please change "the order in which nodes are listed as targets or source" to "the order in which nodes are listed as targets or sources" (add an 's' to "source" to be consistent). fixed

Please change "Processes not in K are returned some processes are returned MPI_COMM_NULL" to "Processes not in K are returned MPI_COMM_NULL". done

Please change "The new distributed graph communicator comm_graph consist of k processes." to "The new distributed graph communicator comm_graph consists of k processes." (add an 's' to "consist" for agreement). fixed

Please change "implementation not to do any reordering at all" to "implementation not to do any reordering" (delete "at all", which is unnecessary). done

The statement "The boolean reorder must take the same value on all processes in the call." reads oddly. No similar statement appears for the other topology communicators. If you insist on retaining it, I suggest you rewrite it as "It is erroneous to specify different values of reorder for matching calls of MPI_DIST_GRAPH_CREATE." fixed

Please change "topology which visits" to "topology that visits" since the clause is required. done

The 'H' in "Hamiltonian" should be upper case. done

Something seems wrong in this C++ prototype:

std::pair<int,int> MPI::Graphcomm::Get_dist_neighbors_count(int rank) const

Shouldn't it just be:

void MPI::Graphcomm::Get_dist_neighbors_count(int indegre[], int outdegree[]) const hmm, this is a philosophical discussion. I think we should ask Jeff (he's on CC and I hope he joins the fun). So std::pair<int,int> is an STL pair of ints, which would be the right way to return two scalars in my opinion. But I don't know what the rules for C++ in MPI are - Jeff can probably clarify as he concocted the C++ bindings.

Please change "MPI_DIST_NEIGHBORS" to "MPI_DIST_GRAPH_NEIGHBORS" in the language neutral prototype. fixed

Please change "array of non-negative integer" to "array of non-negative integers" (add an 's') four times in the language neutral description of MPI_DIST_GRAPH_NEIGHBORS. fixed

Please change "targetweights" to "targetweights" (add an 's'). fixed

Please change "processes that the calling process have as target" to "processes for which the calling process is a source" so it is symmetric with the description of sources. ok, I don't have a strong opinion on this -> changed

In the statements "Since MPI_Graph_create is collective, it would be a small matter to compute the total number of edges. Thus, this functionality could be kept. Note that nnodes is equal to the number of processes in comm." I think you meant "MPI_DIST_GRAPH_CREATE". In any event, "this functionality is ambiguous. Did you mean "Since MPI_DIST_GRAPH_CREATE is collective, it would be a small matter to compute the total number of edges. Thus, MPI_GRAPHDIMS_GET could be kept. Note that nnodes is equal to the number of processes in comm." If we did keep it, wouldn't MPI_DIST_EDGE_COUNT be unnecessary? yes and yes. I'm personally against a function to query the edge count because this can simply be done/cached at the comm by the user without problems and overhead.

Thanks a lot for your review! Torsten

Originally by bronis on 2009-03-04 17:25:42 -0600

I think we are getting pretty close.

I'm not sure that people will understand the current wording for when reorder is false. I don't have anything better to suggest and others still need to review this so I will let them take a hack at it.
Please change "An application is free to pass MPI_UNWEIGHTED instead of the weight array which indicates that all edges have the same (effectively no) weight." to "An application can supply MPI_UNWEIGHTED sourceweights and targetweights to indicate that all edges have the same (effectively no) weight.It is erroneous to supply MPI_UNWEIGHTED for one and not the other."
Please change "Since MPI_Graph_create is collective, it would be a small matter to compute the total number of edges. Thus, this functionality could be kept. Note that nnodes is equal to the number of processes in comm." to "Since MPI_DIST_GRAPH_CREATE is collective, it would be a small matter to compute the total number of edges. Thus, MPI_GRAPHDIMS_GET could be kept. Note that nnodes is equal to the number of processes in comm." I know it is not part of what will be added to the standard but it will make it clearer for others reading the ticket.
I am fine with Jeff making the ruling on the C++ interface question. I did base my suggestion on other C++ functions. In particular, I think I looked at Get_neighbors_count or Get_neighbors...

Originally by htor on 2009-03-04 17:46:16 -0600

Replying to bronis:

I think we are getting pretty close.

I'm not sure that people will understand the current wording for when reorder is false. I don't have anything better to suggest and others still need to review this so I will let them take a hack at it. yes, it's complicated, but I don't think that it's more complicated than what we have right now.

Please change "An application is free to pass MPI_UNWEIGHTED instead of the weight array which indicates that all edges have the same (effectively no) weight." to "An application can supply MPI_UNWEIGHTED sourceweights and targetweights to indicate that all edges have the same (effectively no) weight.It is erroneous to supply MPI_UNWEIGHTED for one and not the other." hmm, there is only one weight array (source and target weights exist only in the query functions).

Please change "Since MPI_Graph_create is collective, it would be a small matter to compute the total number of edges. Thus, this functionality could be kept. Note that nnodes is equal to the number of processes in comm." to "Since MPI_DIST_GRAPH_CREATE is collective, it would be a small matter to compute the total number of edges. Thus, MPI_GRAPHDIMS_GET could be kept. Note that nnodes is equal to the number of processes in comm." I know it is not part of what will be added to the standard but it will make it clearer for others reading the ticket. done

I am fine with Jeff making the ruling on the C++ interface question. I did base my suggestion on other C++ functions. In particular, I think I looked at Get_neighbors_count or Get_neighbors... ah, that's exactly my point. MPI_Graph_neighbors_count returns an int and does not use call by ref. Now we have two ints, and we should return two ints (a pair) instead of changing the signature.

Originally by bronis on 2009-03-04 20:55:14 -0600

Replying to htor:

Replying to bronis:

[snip]

Please change "An application is free to pass MPI_UNWEIGHTED instead of the weight array which indicates that all edges have the same (effectively no) weight." to "An application can supply MPI_UNWEIGHTED sourceweights and targetweights to indicate that all edges have the same (effectively no) weight.It is erroneous to supply MPI_UNWEIGHTED for one and not the other." hmm, there is only one weight array (source and target weights exist only in the query functions).

Sorry. I was trying to get my comments in before a 3:30PM meeting. That's what I get for rushing. Oh well. Let's try that again:

Please change "An application is free to pass MPI_UNWEIGHTED instead of the weight array which indicates that all edges have the same (effectively no) weight." to "An application can supply MPI_UNWEIGHTED for weights to indicate that all edges have the same (effectively no) weight."

[snip]

I am fine with Jeff making the ruling on the C++ interface question. I did base my suggestion on other C++ functions. In particular, I think I looked at Get_neighbors_count or Get_neighbors... ah, that's exactly my point. MPI_Graph_neighbors_count returns an int and does not use call by ref. Now we have two ints, and we should return two ints (a pair) instead of changing the signature.

OK, now I have more time so I went back and found what I looked at. It was this:

void MPI::Graphcomm::Get_dims(int nnodes[], int nedges[]) const

It is a function with two output arrays, just like the one we are discussing. It seems pretty clear that's what you should model on.

Originally by htor on 2009-03-04 21:53:09 -0600

Replying to bronis:

Replying to htor:

Replying to bronis:

[snip]

Please change "An application is free to pass MPI_UNWEIGHTED instead of the weight array which indicates that all edges have the same (effectively no) weight." to "An application can supply MPI_UNWEIGHTED sourceweights and targetweights to indicate that all edges have the same (effectively no) weight.It is erroneous to supply MPI_UNWEIGHTED for one and not the other." hmm, there is only one weight array (source and target weights exist only in the query functions).

Sorry. I was trying to get my comments in before a 3:30PM meeting. That's what I get for rushing. Oh well. Let's try that again:

Please change "An application is free to pass MPI_UNWEIGHTED instead of the weight array which indicates that all edges have the same (effectively no) weight." to "An application can supply MPI_UNWEIGHTED for weights to indicate that all edges have the same (effectively no) weight." no worries :). This one sounds good -> replaced

[snip]

I am fine with Jeff making the ruling on the C++ interface question. I did base my suggestion on other C++ functions. In particular, I think I looked at Get_neighbors_count or Get_neighbors... ah, that's exactly my point. MPI_Graph_neighbors_count returns an int and does not use call by ref. Now we have two ints, and we should return two ints (a pair) instead of changing the signature.

OK, now I have more time so I went back and found what I looked at. It was this:

void MPI::Graphcomm::Get_dims(int nnodes[], int nedges[]) const

It is a function with two output arrays, just like the one we are discussing. It seems pretty clear that's what you should model on. Oh well, there are two things broken with this function:

1) it looks like output arrays ([]), but it's actually just a call by reference (which would rather be a * ... but yeah, everything is a pointer in C and C++) 2) this is inconsistent with the other interfaces that return scalars, I'll refer this question to Jeff, because I can't explain this inconsistency.

But you're right, as much as it sucks, we should be consistent with the (imho inconsistent) current document. -> I changed the interface (I even used the confusing [] notation).

Thanks!

Originally by bronis on 2009-03-04 22:12:58 -0600

OK. I am done for now. All seems good to me and this seems like a good idea that would be nice to have in place for 2.2. Hopefully others will review this so we can have its first reading next month.

Originally by smithbr on 2009-03-19 10:45:01 -0500

Add myself to cc list

Originally by htor on 2009-03-24 18:50:56 -0500

Changes as discussed during the last telecon:

simplified description significantly (we now have the requirement that all processes are in the graph communicator - this is perfectly fine because they will be isolated vertexes with a neighbor count of 0, i.e., the user can easily identify them if needed)
replaced Example 7.4. with an example which can not be expressed in a Cartesian topology and uses weights in a useful way. The example compiles with the attached interface. Optional: We might want to add a figure to explain the example, but it's not too complex actually.

Please review!

All the Best, Torsten

Originally by bronis on 2009-03-24 20:49:27 -0500

Please change "isolated processes are allowed" to "isolated processes (i.e., processes with no incident edges) are allowed"

In the example, shouldn't "y=rank/P" be "y=rank/Q"?

Please change "Example 7.4 A two-dimensional PxQ torus where all processes communicate along the dimensions and along the edges, which can not be modelled with cartesian topologies, could be expressed as shown in the following code (the communications along the dimensions is twice as heavy as the communication along the diagonals in this example): " to "Example 7.4 We can use the graph topologies to express a two-dimensional PxQ torus where all processes communicate to their immediate neighbors and to processes on the diagonals, which can not be modeled with cartesian topologies. In the following code, the communications along to immediate neighbors is expected to be twice that as the communication along the diagonals: "

Originally by rsthakur on 2009-03-25 11:18:03 -0500

I will review this in the next day or two, but we need to have more people looking at this.

Originally by rsthakur on 2009-03-25 21:47:37 -0500

I would like to know if there is any related work on which this proposal is based. It looks fairly complicated to me. I am not sure I even understand it. From the definition of the "sources" array, I would expect it to contain the list of processes to which this process is directly connected (simple adjacency info). But Example 7.3 shows it need not be so. The sources array could contain anything as long as the combination of sources and targets arrays on all processes represents the whole graph. This is unnecessarily complicated. Why not just have each process provide its own adjacency info, i.e., the list of neighbors it is directly connected to, the weights of those edges, and nothing else? That gives you the whole graph in a distributed fashion, and you don't need the degrees and targets arrays.

The current implementation on top of the existing API does not show whether this interface can be optimized or whether it contains the right information needed for optimization (or more information than needed).

IN degrees array specifying the number of targets for each source node in the nodes array (array of non-negative integers)[[BR]] IN targets target nodes for the source nodes in the nodes array (array of non-negative integers)

What does nodes array mean in the above definition? Should it be "sources" array instead?

"The call is collective and it is erroneous to specify different values of reorder for matching calls of MPI_DIST_GRAPH_CREATE." could be worded as "The call is collective and the value for reorder must be the same for all processes in comm_old."

OUT indegree number of edges into this process (non-negative integer)some processes

The "some processes" above seems to be a typo

Originally by traff on 2009-03-26 07:33:54 -0500

I did some minor edits, took in the remarks (partly) from Bronis and Rajeev. Thanks

I think that this is a fairly important and not completely trivial addition for MPI 2.2. It solves (at least, proposes to solve) an important scalability problem with MPI 2.1, and provides the necessary freedom for users and implementors of MPI 2.2 to start promoting the scalable functionality. This will be a good preparation for MPI 3.0! I think (Torsten probably agrees) that these are good reasons to have it for MPI 2.2 - but Rajeev is right, this needs very CAREFUL reviewing and THINKING!

Let me ask another question: if we do not take things like this for MPI 2.2 - then, what is really left for MPI 2.2 that a user should care about? In general, why rush MPI 2.2 so much now? I think for MPI 2.2 we should a) be extremely careful with what we put in, we should not allow things that need immediate fix in 2.3..., b) there should be a few IMPORTANT fixes/extensions, otherwise, why should a user care about MPI 2.2? I'd mention RMA as another such point (admittedly, for which I apologize, I have not had much time to care about this in the past weeks).

As for the proposal being complicated: I don't really think so - it just gives more freedom. Probably in most use cases, the edges out of process i will be known by process i, and process i will just give these in the call; but there is freedom for the cases where process j knows that it'll communicate with i, but i is not aware of this (one-sided applications); or cases where process k knows that i and j will communicate. So there are good arguments of allowing this freedom. It would of course be good to back up with cases stories, where somebody would have liked this functionality, but I don't think it is very contrived or very complicated.

A specific technical thing on this line: I'd VERY MUCH be in favor for letting reorder be local. A process can then specify: "yes", it ok to reorder me to some other processor, or "no", for some reason (much state, cannot afford to ship this elsewhere; or special capabilities of my processor) I need to stay where I am. Let us allow this freedom!

Jesper

Originally by RolfRabenseifner on 2009-03-28 05:25:20 -0500

There is a major problem with this proposal. Solution is provided later on.
- MPI_TOP_TEST, MPI_GRAPHDIMS_GET and MPI_GRAPH_GET p.248-249 and the non-collectivity of these routines (p256.27-28) stay valid (and must stay valid) and are still in use with the new distriuted graph topology interface.
- The new interface causes already communication in MPI_DIST_GRAPH_CREATE to fulfil the non-collective functionality of MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS.
- This implies that there is a break-even-point in performance (time and space) of the old and the new interface, i.e., it would be a question whether the old interface should co-exist for applications with, e.g., medium number of processes, but many threads per MPI process.

I would solve these problems with the following modifications:

Add MPI_DIST_GRAPH in the list of MPI_TOPO_TEST, i.e., before p248.20-22
Add MPI_DIST_GRAPH and MPI::DIST_GRAPH before p497.9
In C++, use MPI::Distgraphcomm
This also changes the interface of the inquiry functions.
Use two interfaces that each one allow to define a Distributed Graph Topology
- Rename your MPI_DIST_GRAPH_CREATE into MPI_DIST_GRAPH_CREATE_GENERAL
- Start with a simple interface MPI_DIST_GRAPH_CREATE that has exactly the arguments that are needed to support MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS.
In the description of the new simple MPI_DIST_GRAPH_CREATE, you should write:

"The calling processes must ensure that each edge of the graph is described in the source and in the destination process with the same weights. If there are multiple edges for a given (source,dest) pair, then the sequence of the weights of these edges does not matter. The complete communication topology is the combination of all edges shown in the sources arrays, which must be identical to the combination of all edges shown in the dests arrays."
1. In MPI, we have to word-pairs: (Source,dest[ination]) and (origin,target).
  - (origin,target) is reserved for 1-sided communication.
  - The un-paired word target is also used on page 209 as synonym for source and destination, see p209.43-44.
  - It is also used similarly in I/O, see p383.44 and p384.
  - (source,target can be found only on p389.23. This should be treated as an inconsistency.
  - You used (source,target).
Please change all "target" in "destination" (in text) or "dest" (in variable names). Don't hesitate to use "dests" for destination arrays.
1. The routines MPI_DIST_GRAPH_CREATE and MPI_DIST_GRAPH_CREATE_GENERAL should be fully described, i.e., without the need of details shown in the sections' introduction. Because both routines are doning a different combination of the distributed parts of the graph, an appropriate wording must be added to MPI_DIST_GRAPH_CREATE_GENERAL. Currently, the description does not show this big difference to the old MPI_GRAPH_CREATE interface.
2. Make MPI_UNWEIGHTED to such a special constant like MPI_BOTTOM.
  - Don't forget to add it after p14.5
  - Add after your sentence on MPI_UNWEIGHTED the following:
  It is erroneous to supply MPI_UNWEIGHTED in one process and not in all other processes of comm_old. Note that MPI_UNWEIGHTED is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be a NULL. In Fortran, MPI_UNWEIGHTED is an object like MPI_BOTTOM (not usable for initilaization or assignment). See Section 2.5.4.
This text is based on the text used in p33.20-23 and p33.26-28.
- This text is needed in both MPI_DIST_GRAPH_CREATE and MPI_DIST_GRAPH_CREATE_GENERAL.
- In MPI_DIST_GRAPH_CREATE, both arrays must be mentioned in that sentence: It is erroneous to supply MPI_UNWEIGHTED in one process and not in all other processes of comm_old for sources and dests.
- In MPI_DIST_GRAPH_NEIGHBOR_COUNT, please add a
  
  OUT weighted boolean flag (logical)
- In MPI_DIST_GRAPH_NEIGHBOR, formulate:
  
  If MPI_UNWEIGHTED is supplied for sourceweights or destweights or both or if MPI_UNWEIGHTED was supplied during the construction of the graph then no weight information is returned in that array or those arrays. The only requirement on the sequence of the values in sources and dests is that two calls to that routine with same input argument {{{comm}} will return the same sequence of edges.
  If maxindegree or maxoutdegree is smaller than the numbers returned by MPI_DIST_GRAPH_NEIGHBOR_COUNT, then only the first part of the full list is returned. Note, that the sequnce need not to be identical to the sequence that was provided in the creation of comm for the case that MPI_DIST_GRAPH_CREATE was used.
  1. You may be the only one who deprecates some routines in MPI-2.2.

Therefore you should write:

On p441.14, change section number from "15.1" into "15.2".
Before p441.14 add section

15.1 Deprecated since MPI-2.2

The following general (graph) toplology constructor function is deprecated and is superseeded by MPI_DIST_GRAPH_CREATE and MPI_DIST_GRAPH_CREATE_GENERAL in MPI-2.2.
Remove 246.1 "7.5.3 General (Graph) Constructor"
Move 246.2-247.48 to here.

And so one with all the routines. There must not be any freedom for the chapter author. The ticket must specify all changes!
After p248.20-22, (together with new MPI_DIST_GRAPH) you must add a sentence telling that "MPI_GRAPH is deprecated since MPI-2.2".
1. Remove the partial sentence with "isolated processes" from the sentence on the size of commm_graph.
  - Add Bronis sentence at the end of your first paragraph of MPI_DIST_GRAPH_CREATE_GENERAL.
  - In new MPI_DIST_GRAPH_CREATE, use
  Isolated processes (i.e., with indegree and outdegree equlas zero) are allowed.
2. Change "with the j-th such edge" into "with t of the j-th such edge".
3. Change "along the dimensions and along the edges" into "along the dimensions and along the diagonals".

Sorry for not having the time earlier for this review.

You should add a rationale or in the intro some text telling that the most scalable interface is MPI_DIST_GRAPH_CREATE and that MPI_DIST_GRAPH_CREATE_GENERAL is vor convenience, but will cause additional communication overhead to setup the appropriate information for the inquiry functions.
Why are the weights not "nonnegative floats"? Integer is strange. And in large hybrid topologies, int may include overflow problems, if each thread uses an "own edge", i.e. there are many parallel edges summing up to sizes larger than 2**31 which is only 2 GB. If there wasn't any vote from the Forum to use int, then I would recommend to change it into "floating-point number" with the language types "float" and "REAL".

Originally by RolfRabenseifner on 2009-03-28 16:11:41 -0500

My review above was done based on (a printed version of) https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/33?version=42 i.e., without knowledge of the last to modifications of the description.

Therefore, my item 8 seems to be already done.

To Rajeev's latest comments:

Yes, I tried to find a performance-enabling solution to your "looks fairly complicated".

To Jesper's latest comments:

I'm not sure whether Torsten (and you) are able to modify the proposal fast enough to restart and finish full reviewing before the meeting.

I would like to see it in MPI-2.2, because it is technically simple. Please be very careful in integrating it correctly into current MPI-2.1.

About non-global reorder argument: Please make both texts, that the reviewers can review already both variants. For me, it sound really good. Especially process 0 in MP_COMM_WORLD has a special role, see p282.7-9, and one may keep this role at process 0 in comm_graph.

The decision on reorder may include also changes on MPI_CART_CREATE. If there is no fast concensus on reorder, it may be necessary to postpone reorder to MPI-3.0. Most important is, to have finished text early and 4 times reviewed text at the meeting.

Originally by htor on 2009-03-29 15:51:18 -0500

Replying to RolfRabenseifner:

First: thanks for you thorough review! I will address the things that I think have not been addressed yet:

There is a major problem with this proposal. Solution is provided later on.

The new interface causes already communication in MPI_DIST_GRAPH_CREATE to fulfil the non-collective functionality of MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS. yes, at least a single allreduce is needed

This implies that there is a break-even-point in performance (time and space) of the old and the new interface, i.e., it would be a question whether the old interface should co-exist for applications with, e.g., medium number of processes, but many threads per MPI process. I would not think so - the main reason to use this interface is to supply more knowledge about future communication paths to the MPI library. In order to apply any optimization, the MPI library has to communicate to get some local or global view of the network. One possibility would be reordering, but even if reordering is not allowed, the library could still map I/O intensive processes in an SMP to processors that have a fast connection to others. If no optimization is to be performed, then what is the use of this interface? I suppose the user can find much better techniques to store graphs, e.g., BGL (or they can be implemented as a library on top of MPI).

I would solve these problems with the following modifications:

This also changes the interface of the inquiry functions.

Use two interfaces that each one allow to define a Distributed Graph Topology

Rename your MPI_DIST_GRAPH_CREATE into MPI_DIST_GRAPH_CREATE_GENERAL

Start with a simple interface MPI_DIST_GRAPH_CREATE that has exactly the arguments that are needed to support MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS.

In the description of the new simple MPI_DIST_GRAPH_CREATE, you should write:

"The calling processes must ensure that each edge of the graph is described in the source and in the destination process with the same weights. If there are multiple edges for a given (source,dest) pair, then the sequence of the weights of these edges does not matter. The complete communication topology is the combination of all edges shown in the sources arrays, which must be identical to the combination of all edges shown in the dests arrays." I understand your (good) intention. However, I am hesitant to add another interface call because I would think that the current interface is a) not too hard to understand (especially if you look at the examples - I can provide more if necessary) and b) does not have huge performance implications because: 1) graph creation requires to create a new communicator anyway, so it requires an allreduce, packing a bit more data on this allreduce does very likely not matter at all 2) graph creation should be used "carefully" (as well as any other communicator creation routine) 3) any useful implementation of graph_create performs optimizations that require other (potentially much heavier) computations/communications

In MPI, we have to word-pairs: (Source,dest[ination]) and (origin,target).

(origin,target) is reserved for 1-sided communication.

The un-paired word target is also used on page 209 as synonym for source and destination, see p209.43-44.

It is also used similarly in I/O, see p383.44 and p384.

(source,target can be found only on p389.23. This should be treated as an inconsistency.

You used (source,target).

Please change all "target" in "destination" (in text) or "dest" (in variable names). Don't hesitate to use "dests" for destination arrays. Very good point, however, I can not come up with a better name right now and destinations doesn't sound right for me. Destinations are only used in relation to send operations where a graph does not necessarily mean that anything is sent. I would propose to rename "target" to "sink" which is consistent with graph theory but might even be more confusing (Jesper didn't like my Landau notation either). What do you think?

Make MPI_UNWEIGHTED to such a special constant like MPI_BOTTOM.

Don't forget to add it after p14.5

Add after your sentence on MPI_UNWEIGHTED the following: Page 14.5? Are you sure?

It is erroneous to supply MPI_UNWEIGHTED in one process and not in all other processes of comm_old. Note that MPI_UNWEIGHTED is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be a NULL. In Fortran, MPI_UNWEIGHTED is an object like MPI_BOTTOM (not usable for initilaization or assignment). See Section 2.5.4. fixed

In MPI_DIST_GRAPH_NEIGHBOR_COUNT, please add a

OUT weighted boolean flag (logical) yes, makes sense (saves a constant) - done

In MPI_DIST_GRAPH_NEIGHBOR, formulate:

If MPI_UNWEIGHTED is supplied for sourceweights or destweights or both or if MPI_UNWEIGHTED was supplied during the construction of the graph then no weight information is returned in that array or those arrays. The only requirement on the sequence of the values in sources and dests is that two calls to that routine with same input argument {{{comm}} will return the same sequence of edges.
If maxindegree or maxoutdegree is smaller than the numbers returned by MPI_DIST_GRAPH_NEIGHBOR_COUNT, then only the first part of the full list is returned. Note, that the sequnce need not to be identical to the sequence that was provided in the creation of comm for the case that MPI_DIST_GRAPH_CREATE was used. ok, applied with minor changes

You may be the only one who deprecates some routines in MPI-2.2.

Therefore you should write:

On p441.14, change section number from "15.1" into "15.2".

Before p441.14 add section

15.1 Deprecated since MPI-2.2

The following general (graph) toplology constructor function is deprecated and is superseeded by MPI_DIST_GRAPH_CREATE and MPI_DIST_GRAPH_CREATE_GENERAL in MPI-2.2.

Remove 246.1 "7.5.3 General (Graph) Constructor"

Move 246.2-247.48 to here.

And so one with all the routines. There must not be any freedom for the chapter author. The ticket must specify all changes! I agree, however, I did not say that this ticket must deprecate functions. We have no official way to deprecate functions yet, and remember how the Forum broke, when it came to a versioning strategy for MPI-3. I do not want that such pointless discussions hold this important fix back, thus, the deprecation is fully optional and we need to decide if we want to do this and if yes, then how we want to do this.

You should add a rationale or in the intro some text telling that the most scalable interface is MPI_DIST_GRAPH_CREATE and that MPI_DIST_GRAPH_CREATE_GENERAL is vor convenience, but will cause additional communication overhead to setup the appropriate information for the inquiry functions. I would not say this because the additional communication overhead might be neglegible. Stating this in the standard makes assumptions about implementations that we don't know yet (should not make).

Why are the weights not "nonnegative floats"? Integer is strange. there are many graph-theoretical reasons for that. My short answer is that it simplifies partitioning algorithms and reasoning about them (cf. Radix algorithms for SSSP problems). However, I am not sure if those simplifications are applicable to our domain (but I think so). I have no strong opinion, but integer values give us the possibility to distinguish 2^63 weights on 64 bit architectures which should be enough (the weights are all relative anyway and normalization can be applied without changing the result). Thus, we should not prevent such optimizations.

And in large hybrid topologies, int may include overflow problems, if each thread uses an "own edge", i.e. there are many parallel edges summing up to sizes larger than 2**31 which is only 2 GB. If there wasn't any vote from the Forum to use int, then I would recommend to change it into "floating-point number" with the language types "float" and "REAL". Do you still run 32-bit architectures? Are they still relevant? I also don't fully understand how this problem can not be solved with normalization into the range 0..2^31. Do you have a particular example?

Thanks for your review!

Originally by htor on 2009-03-29 15:58:57 -0500

Replying to RolfRabenseifner:

My review above was done based on (a printed version of) https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/33?version=42 i.e., without knowledge of the last to modifications of the description.

Therefore, my item 8 seems to be already done. partially - I addressed the others above (I hope that I caught all).

To Jesper's latest comments:

I'm not sure whether Torsten (and you) are able to modify the proposal fast enough to restart and finish full reviewing before the meeting. I think we are, it's important.

I would like to see it in MPI-2.2, because it is technically simple. Please be very careful in integrating it correctly into current MPI-2.1. yes

About non-global reorder argument: Please make both texts, that the reviewers can review already both variants. For me, it sound really good. Especially process 0 in MP_COMM_WORLD has a special role, see p282.7-9, and one may keep this role at process 0 in comm_graph. I agree to your comments and think that having reorder locally could improve usability without too many performance implications.

The decision on reorder may include also changes on MPI_CART_CREATE. If there is no fast concensus on reorder, it may be necessary to postpone reorder to MPI-3.0. yes

Most important is, to have finished text early and 4 times reviewed text at the meeting. working on it - please review all changes!

Originally by bronis on 2009-03-29 16:32:24 -0500

I don't quite see how you can choose to reorder at some tasks and not others. Is the implication that one of the choices dominates? If not, how do you rectify any discrepancies?

Originally by htor on 2009-03-29 16:39:50 -0500

Replying to bronis:

I don't quite see how you can choose to reorder at some tasks and not others. Is the implication that one of the choices dominates? If not, how do you rectify any discrepancies? Just imagine that if at least two processes specify reorder=true, then the graph mapping problem can be solved under the constraint that all processes with reorder=false are "pinned".

Another (maybe simpler) way to look at it is that the call defines two groups, one group that can be reordered, and another group that can't. The graph mapping problem can be solved for the processes that can be reordered (one needs to consider edges to the other group here).

This can be necessary if, for example, you run a quantum chemistry code that does fancy OpenGL visualization on rank 0. But you only have one vizualisation node in your cluster, thus, you need to pin process 0 to this node (reorder=false) while all other processes can be reordered under this constraint. This seems mostly useful in heterogeneous environments.

Best, Torsten

Originally by bronis on 2009-03-29 20:39:02 -0500

Replying to htor:

Replying to bronis:

I don't quite see how you can choose to reorder at some tasks and not others. Is the implication that one of the choices dominates? If not, how do you rectify any discrepancies? Just imagine that if at least two processes specify reorder=true, then the graph mapping problem can be solved under the constraint that all processes with reorder=false are "pinned".

Another (maybe simpler) way to look at it is that the call defines two groups, one group that can be reordered, and another group that can't. The graph mapping problem can be solved for the processes that can be reordered (one needs to consider edges to the other group here).

This can be necessary if, for example, you run a quantum chemistry code that does fancy OpenGL visualization on rank 0. But you only have one vizualisation node in your cluster, thus, you need to pin process 0 to this node (reorder=false) while all other processes can be reordered under this constraint. This seems mostly useful in heterogeneous environments.

This is a very nice explanation; if we allow this functionality then it needs to be the basis of an advise to users. Part of it is probably also appropriate for an advice to implementers.

I suggest you draft these so they can be reviewed before the meeting.

Originally by traff on 2009-03-30 12:04:58 -0500

I have started an update on this ticket based on the recent discussion, and will finish tomorrow morning (european time)

Don't start your review now...

Jesper

Originally by traff on 2009-03-31 06:54:16 -0500

I'm still updating, don't review yet

Jesper

Originally by traff on 2009-03-31 09:00:22 -0500

Thanks a lot to Rolf (and Bronis) for the detailed review and comments. I've finished a revise and tried to take this into account.

1. I modified the proposal to allow reorder to be selected locally; processes with reorder = true may be remapped, others must stay. I added (based on Torsten's and Bronis' discussion) advice to users and implementors

2. I changed "target" to the more MPI fitting "destination", throughout

3. MPI_DISTRIBUTED is too generic, changed to more specific DIST_GRAPH

4. I say that MPI_UNWEIGHTED is a special value; thus no need to say that in C it is likely to be NULL or whatever (in fact, I think at this point it should be left to somewhere else if at all necessary to say what these values are)

5. For C++ I introduced the derived class Distgraphcomm, and (hopefully) made the updates necessary through the staadard (COMM_DUP and so forth)

6. Most controversially, I I decided not to split the distributed creation function into two. The argument is that the saving in interface complexity is after all not that large, it adds another, almost similar function to the standard, and the more restricted function that Rolf (and Rajeev) proposes will, I think, also require communication (which Rolf wants to save) - namely to collect the "ingoing" edges; otherwise, we would for this interface have to require that each rank specifies all of its ingoing and outgoing edges - and I don't think this is reasonable (eg. in one sided communication, the origin-processes typically have some information; the target processes do not) - the idea of the new interface is also to relive the user of collecting information he or she does not naturally have

Jesper

Originally by traff on 2009-03-31 10:45:23 -0500

Added an important chance in the general description. Updated change log entry.

Jesper

Originally by bronis on 2009-03-31 12:29:33 -0500

Please change "For MPI_DIST_GRAPH_CREATE the input communication graph is distributed across the calling processes which therefore provide different values for the arguments specifying the graph." to "For MPI_DIST_GRAPH_CREATE the input communication graph is distributed across the calling processes. Therefore the processes provide different values for the arguments specifying the graph." The current sentence is long and hard to parse.
Please change this sentence: "All processes must give the same value for the info argument, though." to "However, all processes must give the same value for the info argument.
Please change this sentence: "The general graph constructor assumes that each process pass the full (global) communication graph to the call." to "The general graph constructor assumes that each process passes the full (global) communication graph to the call."
Please change this sentence: "Typically, this could be the set of processes for which the process knows that it will eventually receive or get data, or the set of processes to which the process will send or put data, or some combination of such edges." to "Typically, this could be the set of processes from which the process will eventually receive or get data, or the set of processes to which the process will send or put data, or some combination of such edges." Don't anthropomorphize. Similarly change "The interface, however, provides full flexibility, and processes that knows that communication between other pairs of processes will take place can contribute this." to "The interface, however, provides full flexibility, and processes can indicate that communication will occur between other pairs of processes."
Please change "hints on optimization and interpretation of weights" to "hints on optimization and interpretation of weights (handle)" -- add "(handle)"
Please change "Both the sources and the destinations array" to "Both the sources and the destinations arrays" (add 's')
Should this "The call creates a new communicator comm_graph of distributed topology type " be "The call creates a new communicator comm_graph of distributed graph topology type" (i.e., is "graph" needed)?
Please change "(i.e. processes with no outgoing" to "(i.e., processes with no outgoing" (add comma)
Please change this paragraph:

Advice to users. Imagine an application that does visualization or heavy I/O on rank 0, and that only this processor has the corresponding visualization or I/O capabilities. If this rank is to be part of a graph topology it is desirable to be able to specify that this rank will not be remapped. Another scenario is an application where some ranks have very heavy state information that is (too) expensive to migrate. By setting reorder = false for such ranks, it is guaranteed that these processes will preserve their rank in the distributed graph communicator. (End of advice to users.)

to:

Advice to users. Specifying reorder # false on only some nodes can support heterogeneous application to hardware mapping. For example, suppose an application does visualization or heavy I/O on rank 0, and that only this processor has the corresponding visualization or I/O capabilities. If this rank is part of a graph topology, it is useful to specify that this rank will not be remapped. Allowing remapping of other ranks can then exploit features of the interconnection network topology. Another scenario is an application where some ranks have large state information that would be (too) expensive to migrate. Setting reorderfalse for these ranks guarantees that migration will not be necessary since these processes will preserve their rank in the distributed graph communicator. (End of advice to users.)

Please change this paragraph:

Advice to implementors. The reordering optimization problem to be solved must take the extra constraints that some nodes must remain fixed into account. An easy way to do this would be to eliminate the fixed nodes and all their edges, and solve the optimization problem on the reduced graph of remapabale nodes. Better results may be possible by considering the whole problem, though. (End of advice to implementors.)

to:

Advice to implementors. The reordering optimization problem must account for the constraint that nodes with reorder = false must remain fixed. An easy way to account for this constraint is to eliminate the fixed nodes and all their edges, and solve the optimization problem on the reduced graph. However, better results may be possible by considering the entire graph. (End of advice to implementors.)

Please change these sentences "An implementation need not explicitly construct the topology from its distributed parts. All processes can however easily construct the full topology from the distributed specification and use this in a call to MPI_GRAPH_CREATE to create the topology." to "An implementation does not explicitly need to construct the topology from its distributed parts. However, all processes can easily construct the full topology from the distributed specification and use this in a call to MPI_GRAPH_CREATE to create the topology."
Before "Example 7.4", please add "In both cases above, the user could supply MPI_UNWEIGHTED instead of explicitly providing identical weights."
Please change "The only requirement on the sequence of the values" to "The only requirement on the sequence of values" (delete "the").

Originally by traff on 2009-03-31 12:51:05 -0500

Hi Bronis,

thanks for the comments - I adopted all of them (with only very minimal changes), hope I didn't goof

Jesper

Originally by bronis on 2009-03-31 17:46:50 -0500

Looks good. Consider this one review.

Originally by traff on 2009-04-01 02:22:25 -0500

A few minor edits:

added an extra advice to implemetors, requiring them to document additionally supported info key-value pairs.
use i-th, j-th, no superscripts - this is the way it is done in the collectives chapter.
fonts/form of "advice" made consistent with rest of standard
considered changing "key-value pair" to "(key,value) pair" (as used in the info-chapter), but didn't change - the standard is not consistent here (in I/O I've seen "key/value") - and maybe doesn't have to be...

Originally by bronis on 2009-04-01 03:56:39 -0500

Changes OK.

Originally by smithbr on 2009-04-02 12:21:42 -0500

destinations[4] # P((y+1)%Q)+(x+1)%P; weights[5]1; destinations[5] # P((Q+y-1)%Q)+(x+1)%P; weights[6]1; destinations[6] # P((y+1)%Q)+(P+x-1)%P; weights[7]1; destinations[7] # P((Q+y-1)%Q)+(P+x-1)%P; weights[8]2;

In the example, why is the weight 2 for the "backwards" diagonal? Is that a typo?

Originally by htor on 2009-04-02 13:08:19 -0500

Replying to smithbr:

In the example, why is the weight 2 for the "backwards" diagonal? Is that a typo? yes, that's indeed a typo - fixed!

Best, Torsten

Originally by bronis on 2009-04-02 13:27:11 -0500

Hmm. Sorry I missed that typo. The correction looks good. Re-reviewed...

Originally by traff on 2009-04-03 04:41:29 -0500

I added a clarifying sentence (after Distr_graph_create) that it is allowed for different processes to contribute the same (s,d) edge - multiple edges are in general allowed; their interpretation left to the implementation. Also clarified that all contributed edges are returned by the query calls.

It could be considered to define standard info (key,value)-pairs already now to tell the MPI implementation what to do with multiple edges, e.g., (edgeweight,sum), (edgeweight,max), (edgeweight,min), (edgeweight,pick_some_at_random), ... ?

Jesper

Originally by bronis on 2009-04-03 05:21:40 -0500

I'm using the diff so I will specify these one at a time if I have more than one comment:

Please change "It is likewise allowed that different processes specify (s,d) edges with the same s and d." to "Similarly, different processes may specify edges with the same source and destination." The extra formalism does not help but rather makes the text confusing.

Originally by bronis on 2009-04-03 05:22:52 -0500

Other than that one sentence the other changes are OK.

Originally by traff on 2009-04-03 05:45:18 -0500

Thanks, done

mpi-forum / mpi-forum-historic

Fix Scalability Issues in Graph Topology Interface #33

Fix Scalability Issues in Graph Topology InterfaceAuthor: Torsten Hoefler, Jesper Larsson Traeff with comments from Rolf

Status

Description

History

Proposed Solution

Functions to Deprecate (not included in this ticket but kept for reference!)

Possible additional functionality (not included in this ticket but kept for future reference)

Impact on Implementations

Impact on Applications / Users

Alternative Solutions

Entry for the Change Log