Closed mpiforumbot closed 8 years ago
Originally by bronis on 2009-04-03 06:16:21 -0500
Reviews OK, again.
Originally by smithbr on 2009-04-03 08:19:26 -0500
I didn't see anything else beyond that typo that hasn't already been addressed.
Originally by RolfRabenseifner on 2009-04-03 11:19:26 -0500
Thank you for ticket #147 .
Here a few minor bugs:
IN destinations target nodes...
"[[BR]]
should read[[BR]]
"IN destinations destination nodes...
"INTEGER COMM_OLD, N, SOURCES(*), DEGREES(*), TARGETS(*) WEIGHTS(*) INFO, COMM_GRAPH, IERROR
[[BR]]
should read (TARGETS->DESTINATIONS, and commas added)[[BR]]
INTEGER COMM_OLD, N, SOURCES(*), DEGREES(*), DESTINATIONS(*), WEIGHTS(*), INFO, COMM_GRAPH, IERROR
The sentence
"Isolated processes (i.e., processes with no outgoing or incoming edges, that is, processes that do not occur as source or destination node in the graph specification) are allowed."
should be moved from its current location (telling something about the new new communicator) to the end of the previous paragraph, i.e., after
"This allows a fully distributed specification of the communication graph."
Reason: With the current location of the sentence, it looks like that some nodes may be split off, i.e., having a comm-self communicator, which does not fit to the sentence before about the size of comm_graph.
Please write (otherwise the position is not absolutely clear):
Page 248, line 21, add (i.e., after MPI_GRAPH):
On the "OUT weighted ..." line, please add at the end
(logical)
Originally by RolfRabenseifner on 2009-04-03 11:31:41 -0500
The following item is still open:
Current text reads:
It is erroneous to supply MPI_UNWEIGHTED in one process and not in all other processes of comm_old.
but should read:
It is erroneous to supply MPI_UNWEIGHTED in one process and not in all other processes of comm_old. Note that MPI_UNWEIGHTED is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be a NULL. In Fortran, MPI_UNWEIGHTED is an object like MPI_BOTTOM (not usable for initilaization or assignment). See Section 2.5.4.
Normally this type of text is added for all these special constants. You replied "fixed", but you lost the additional sentences.
Originally by htor on 2009-04-03 12:11:51 -0500
Replying to RolfRabenseifner:
Thank you for ticket #147 . please feel free to edit it!
Here a few minor bugs:
- "
IN destinations target nodes...
"[[BR]] should read[[BR]] "IN destinations destination nodes...
" fixed - thanksINTEGER COMM_OLD, N, SOURCES(*), DEGREES(*), TARGETS(*) WEIGHTS(*) INFO, COMM_GRAPH, IERROR
[[BR]] should read (TARGETS->DESTINATIONS, and commas added)[[BR]]INTEGER COMM_OLD, N, SOURCES(*), DEGREES(*), DESTINATIONS(*), WEIGHTS(*), INFO, COMM_GRAPH, IERROR
fixedThe sentence
"Isolated processes (i.e., processes with no outgoing or incoming edges, that is, processes that do not occur as source or destination node in the graph specification) are allowed."
should be moved from its current location (telling something about the new new communicator) to the end of the previous paragraph, i.e., after
"This allows a fully distributed specification of the communication graph."
Reason: With the current location of the sentence, it looks like that some nodes may be split off, i.e., having a comm-self communicator, which does not fit to the sentence before about the size of comm_graph. yes, that's true - fixed
Please write (otherwise the position is not absolutely clear):
Page 248, line 21, add (i.e., after MPI_GRAPH): fixed
On the "OUT weighted ..." line, please add at the end
(logical) fixed
To your other comment: I re-added the text as it is consistent with the MPI-2.1 standard.
Thanks, Torsten
Originally by RolfRabenseifner on 2009-04-03 12:28:42 -0500
I fixed a small bug: MPI_DIST_GRAPH_NEIGHBORS(..., MAXOUTDEGREE, OUTDEGREE, should read MPI_DIST_GRAPH_NEIGHBORS(..., MAXOUTDEGREE,
Technical review: All looks fine now.
My opinion: Additional interface proposal in ticket #147 should be also available. (I'm currently editing it).
Originally by RolfRabenseifner on 2009-04-03 12:34:16 -0500
Current text: void MPI::Distgraphcomm::Get_dist_neighbors(int rank, int maxindegree, should read void MPI::Distgraphcomm::Get_dist_neighbors(int maxindegree,
I modified this.
Originally by bronis on 2009-04-03 12:38:26 -0500
I made a very small change -- it had read:
a non-negative number of destination nodes are specified
but the verb should agree with "number" so it should read:
a non-negative number of destination nodes is specified
so I changed "are" to "is"
Originally by htor on 2009-04-03 16:20:18 -0500
fixed minor spelling mistakes.
Originally by bronis on 2009-04-03 16:33:11 -0500
Are we there yet? I hope so. These changes review fine.
Originally by gropp on 2009-04-05 16:38:07 -0500
Reviewed and fine.
Originally by rsthakur on 2009-04-05 19:24:52 -0500
IN sources array containing the n source nodes for which this
process contributes edges (array of non-negative integers)
The calling process doesn't "contribute" edges to the source nodes. The calling process may not even be connected to any of the source nodes. So "specifies edges" may be a better wording than "contributes edges".
IN reorder the process may be reordered (true) or not (false) (logical)
Instead of "the process may be reordered" say "the calling process's rank may be reordered"
Allowing some processes to say reorder and some to say don't reorder makes it complicated for the implementation. The implementation may need to do an all-to-all to figure out what's going on. (Therefore, the current simple example implementation on top of existing graph topology functions doesn't illustrate performance benefits or pitfalls.)
Advice to implementors. The reordering optimization problem must account for the constraint that processes with reorder = false must remain fixed. An easy way to do this is to eliminate the fixed nodes and all their edges, and solve the optimization problem on the reduced graph.
I don't think it is easy. To know which nodes have passed reorder=false you have to communicate. And the edges for those nodes may not be specified by those nodes. They may be specified by a completely different set of nodes.
These calls are local. The number of edges into and out of the process returned by MPI_Dist_graph_neighbors_count are the total number of such edges given in the call to MPI_Dist_graph_create.
I would add at the end: "(potentially by processes other than the calling process)."
It would be good to specify whether or not the old graph functions (MPI_Graph_get and others listed under deprecate above) can still be used on a graph created with MPI_Dist_graph_create. One option is to say they cannot be used. Because to implement MPI_Graph_get, the implementation would have to collect the entire graph onto each process in MPI_Dist_graph_create itself, defeating the purpose.
Originally by htor on 2009-04-05 19:42:13 -0500
I would say our review technique has serious convergence issues :).
Replying to rsthakur:
IN sources array containing the n source nodes for which this process contributes edges (array of non-negative integers)
The calling process doesn't "contribute" edges to the source nodes. The calling process may not even be connected to any of the source nodes. So "specifies edges" may be a better wording than "contributes edges". I agree - fixed
IN reorder the process may be reordered (true) or not (false) (logical)
Instead of "the process may be reordered" say "the calling process's rank may be reordered"
Allowing some processes to say reorder and some to say don't reorder makes it complicated for the implementation. The implementation may need to do an all-to-all to figure out what's going on. yes, I'm in favor of specifying an Info argument like "all_reorder" or such for this case. I added it to my slides for tomorrow!
(Therefore, the current simple example implementation on top of existing graph topology functions doesn't illustrate performance benefits or pitfalls.) Ha! That would lift the barrier for a reference implementation to a whole new level!
Advice to implementors. The reordering optimization problem must account for the constraint that processes with reorder = false must remain fixed. An easy way to do this is to eliminate the fixed nodes and all their edges, and solve the optimization problem on the reduced graph.
I don't think it is easy. To know which nodes have passed reorder=false you have to communicate. And the edges for those nodes may not be specified by those nodes. They may be specified by a completely different set of nodes. yes, distributed graphs are not simple.
These calls are local. The number of edges into and out of the process returned by MPI_Dist_graph_neighbors_count are the total number of such edges given in the call to MPI_Dist_graph_create.
I would add at the end: "(potentially by processes other than the calling process)." added
It would be good to specify whether or not the old graph functions (MPI_Graph_get and others listed under deprecate above) can still be used on a graph created with MPI_Dist_graph_create. One option is to say they cannot be used. Because to implement MPI_Graph_get, the implementation would have to collect the entire graph onto each process in MPI_Dist_graph_create itself, defeating the purpose. they are not supported (i.e., MPI-2.2 will only allow objects of type MPI_GRAPH in them, everything else will throw the usual topology error). Do we need to specify this explicitly?
Thanks, Torsten
Originally by rsthakur on 2009-04-05 19:55:50 -0500
Replying to htor:
Advice to implementors. The reordering optimization problem must account for the constraint that processes with reorder = false must remain fixed. An easy way to do this is to eliminate the fixed nodes and all their edges, and solve the optimization problem on the reduced graph.
I don't think it is easy. To know which nodes have passed reorder=false you have to communicate. And the edges for those nodes may not be specified by those nodes. They may be specified by a completely different set of nodes. yes, distributed graphs are not simple.
Therefore, the text shouldn't say "An easy way to do this is" :-).
It would be good to specify whether or not the old graph functions (MPI_Graph_get and others listed under deprecate above) can still be used on a graph created with MPI_Dist_graph_create. One option is to say they cannot be used. Because to implement MPI_Graph_get, the implementation would have to collect the entire graph onto each process in MPI_Dist_graph_create itself, defeating the purpose. they are not supported (i.e., MPI-2.2 will only allow objects of type MPI_GRAPH in them, everything else will throw the usual topology error). Do we need to specify this explicitly?
Rolf's comment above (first bullet in https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/33#comment:46) indicates he expects those functions to be supported. Hence I asked.
Originally by htor on 2009-04-05 20:01:30 -0500
Replying to rsthakur:
Replying to htor:
Advice to implementors. The reordering optimization problem must account for the constraint that processes with reorder = false must remain fixed. An easy way to do this is to eliminate the fixed nodes and all their edges, and solve the optimization problem on the reduced graph.
I don't think it is easy. To know which nodes have passed reorder=false you have to communicate. And the edges for those nodes may not be specified by those nodes. They may be specified by a completely different set of nodes. yes, distributed graphs are not simple.
Therefore, the text shouldn't say "An easy way to do this is" :-). yeah, what we meant with this is that the two partitions are easy to figure out (on alltoall or allreduce with P elements). But you're right - the wording is suboptimal (we should not rate any method in the standard) - I changed it.
It would be good to specify whether or not the old graph functions (MPI_Graph_get and others listed under deprecate above) can still be used on a graph created with MPI_Dist_graph_create. One option is to say they cannot be used. Because to implement MPI_Graph_get, the implementation would have to collect the entire graph onto each process in MPI_Dist_graph_create itself, defeating the purpose. they are not supported (i.e., MPI-2.2 will only allow objects of type MPI_GRAPH in them, everything else will throw the usual topology error). Do we need to specify this explicitly?
Rolf's comment above (first bullet in https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/33#comment:46) indicates he expects those functions to be supported. Hence I asked. hmm, ok - we will talk about this tomorrow (unless you have a specific proposal where to add which text :).
Thanks, Torsten
Originally by rsthakur on 2009-04-05 20:08:26 -0500
I am ok with saying the old graph functions cannot be used with MPI_DIST_GRAPH topologies. But it can be added after tomorrow's discussion.
Originally by bronis on 2009-04-05 20:36:33 -0500
Recent changes reviewed OK.
Originally by rsthakur on 2009-04-05 22:55:51 -0500
- There is a major problem with this proposal. Solution is provided later on.
- The new interface causes already communication in MPI_DIST_GRAPH_CREATE to fulfil the non-collective functionality of MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS. yes, at least a single allreduce is needed
Isn't it more like an all-to-all? Each process needs to talk to every other process to find out whom it is connected to. If you do an allreduce, it would be over a very large vector of size equal to the total number of edges in the graph.
Originally by traff on 2009-04-06 01:54:27 -0500
Replying to RolfRabenseifner:
The following item is still open:
Current text reads:
It is erroneous to supply MPI_UNWEIGHTED in one process and not in all other processes of comm_old.
but should read:
It is erroneous to supply MPI_UNWEIGHTED in one process and not in all other processes of comm_old. Note that MPI_UNWEIGHTED is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be a NULL. In Fortran, MPI_UNWEIGHTED is an object like MPI_BOTTOM (not usable for initilaization or assignment). See Section 2.5.4.
Normally this type of text is added for all these special constants. You replied "fixed", but you lost the additional sentences.
Rolf, I replied that I found the rest superfluous - and I still do: I don't think this is the place to say how MPI_UNWEIGHTED may be implemented. It's already said that MPI_UNWEIGHTED is a special value and in my opinion this suffices
Jesper
Originally by traff on 2009-04-06 02:38:10 -0500
Replying to rsthakur:
I am ok with saying the old graph functions cannot be used with MPI_DIST_GRAPH topologies. But it can be added after tomorrow's discussion.
It could be added, but is not strictly necessary: For these functions it is alreay in the prototype said that the input communicator is "with graph topology". But maybe the word "general" should be dropped on Page 251, line 35. (or add the sentence here: "These functions are not applicable to distributed graph topologies.")
Other than that (and that I'd like to drop the overspecification of MPI_UNWEIGHTED), I think I'm fine with the proposal as it is now.
Jesper
Originally by chsi on 2009-04-06 07:28:50 -0500
Hi all,
I'm sorry that this idea pops up so late. Nevertheless, I'd propose to replace the logical 'reorder' argument (e.g., 'bool reorder' in the C++ binding) with a numerical 'color' argument (e.g., 'int color').
Reason: The {true,false} possibilities of 'reorder' are too limited.
More details:
One "Advice to users" already takes about heterogeneous applications but gives an example with only ONE special rank (rank 0 as I/O). Now imagine you have several nodes with this "virtualization capabilities" or even further special nodes, but don't want to restrict the ordering within these subgroups. The 'color' argument would be almost identical to the argument in MPI_Comm_split(), i.e., you partition the communicator into disjoint subgroups, one for each value of color. So each subgroup contains all those processes that supplied the same color value. Reordering is now only allowed within each subgroup. This approach is much more flexible (e.g., all virtualization nodes provide the same color in the example) but can easily "emulate" the old semantics (i.e., the 'bool reorder') as follows: All participants with 'reorder=false' supply their own rank number as 'color' argument and all other processes (i.e., with 'reorder=true' provide the size of the communicator as 'color' value.
Christian
Originally by htor on 2009-04-06 18:08:05 -0500
This is a merger of #147 and #33 as requested by Rolf, please don't review yet!
Originally by htor on 2009-04-06 18:33:21 -0500
First version of the merge with both changes after the April'09 straw-vote. Up for review (use the diff feature to look at the differences to #147 which has been merged without changes in the previous version of this ticket)!
Originally by bronis on 2009-04-06 18:43:11 -0500
Looks good.
Originally by htor on 2009-04-06 18:53:36 -0500
(minor modifications - added function interfaces where necessary)
Originally by htor on 2009-04-06 19:25:15 -0500
remove duplicated function definitions.
Originally by rsthakur on 2009-04-06 21:57:39 -0500
In the 2nd para of defn of the adjacent version, it should be "The call to MPI_DIST_GRAPH_CREATE_ADJACENT is collective." (instead of MPI_DIST_GRAPH_CREATE)
The whole paragraph beginning with "Weights are specified as non-negative integers" is repeated in the two routines, whereas for reorder and info, it says see the next routine. You could keep even the weights description in one place and add it to the see the next routine list.
process "contributes" edges has crept back in in the defn of MPI_Dist_graph_create in both places. you could change it to "specifies" edges.
Since there are now two functions before "(potentially by processes other than the calling process)", I would change it to "(potentially by processes other than the calling process in the case of MPI_DIST_GRAPH_CREATE)" just to be clear.
Otherwise looks good.
Originally by htor on 2009-04-06 22:18:46 -0500
Replying to rsthakur:
In the 2nd para of defn of the adjacent version, it should be "The call to MPI_DIST_GRAPH_CREATE_ADJACENT is collective." (instead of MPI_DIST_GRAPH_CREATE) fixed
The whole paragraph beginning with "Weights are specified as non-negative integers" is repeated in the two routines, whereas for reorder and info, it says see the next routine. You could keep even the weights description in one place and add it to the see the next routine list. it's actually slightly different in both cases.
process "contributes" edges has crept back in in the defn of MPI_Dist_graph_create in both places. you could change it to "specifies" edges. fixed :)
Since there are now two functions before "(potentially by processes other than the calling process)", I would change it to "(potentially by processes other than the calling process in the case of MPI_DIST_GRAPH_CREATE)" just to be clear. yes, fixed
Rolf, Hubert (thanks a lot!) and me did a thorough review of the ticket tonight. We applied some more minor changes (mostly constants in all the constant tables etc.). Otherwise, it looks good. Btw., please deprecate the C++ bindings asap!
Please take a quick look at the changes (use the diff at the ticket).
Thanks & all the Best, Torsten
Originally by jsquyres on 2009-04-06 22:27:58 -0500
I do believe it's a bug in Trac that there was an un-terminated "superscript" much earlier in this ticket that was wonking the font. Let's try to terminate it here: ^foo.
Originally by jsquyres on 2009-04-06 22:30:19 -0500
Maybe I should have tried an unterminated ,,subscript.
Originally by jsquyres on 2009-04-06 22:31:16 -0500
Maybe one more unterminated subscript to match the 2 unterminated ,,superscripts?
Let's see...
Originally by htor on 2009-04-06 22:35:13 -0500
some more line breaks in the code sections to make it printable for Rolf :) (no real other changes - stupid trac)
Originally by RolfRabenseifner on 2009-04-06 22:43:51 -0500
looks fine.
Originally by traff on 2009-04-07 02:19:59 -0500
Reviewed, ok, I hope (some significant changes has happened, not happy with all of them, but I appreciate that the ticket has been so thoroughly discussed)
Jesper
Originally by jsquyres on 2009-04-07 07:36:47 -0500
Sorry for jumping in so late; it was impossible to keep up with all the changes on this ticket.
New section 7.5.4 has this sentence:
MPI_DIST_GRAPH_CREATE_ADJACENT creates a distributed graph communicator with each process specifying all incoming and outgoing (adjacent) edges in the logical communication graph and thus requires minimal communication during creation.
Why is "(adjacent)" in parentheses? Is it required that the edges be connected to the node representing this process in the graph? If so, then be explicit.
#!c
void MPI::Distgraphcomm::Get_dist_neighbors_count(int rank, int indegree[],
int outdegree[], bool *weighted) const
should be (change bool * to bool &):
#!c
void MPI::Distgraphcomm::Get_dist_neighbors_count(int rank, int indegree[],
int outdegree[], bool &weighted) const
Originally by bronis on 2009-04-07 08:06:44 -0500
Replying to jsquyres:
Sorry for jumping in so late; it was impossible to keep up with all the changes on this ticket.
[snip]
Good comments Jeff even if they continue the convergence problem for this ticket. I would assert that your suggestions should not reset reviews on this ticket. Anyway, I disagree with the following:
- Missing space: "The number of edges in
to and out of the process returned by ..."
The word is "into" -- see:
Originally by htor on 2009-04-07 08:33:57 -0500
Replying to jsquyres:
Sorry for jumping in so late; it was impossible to keep up with all the changes on this ticket.
- p243:8 sentence is missing a comma: "... For MPI_DIST_GRAPH_CREATE_ADJACENT and MPI_DIST_GRAPH_CREATE, the input communication graph is distributed..."
New section 7.5.4 has this sentence: fixed
MPI_DIST_GRAPH_CREATE_ADJACENT creates a distributed graph communicator with each process specifying all incoming and outgoing (adjacent) edges in the logical communication graph and thus requires minimal communication during creation.
Why is "(adjacent)" in parentheses? Is it required that the edges be connected to the node representing this process in the graph? If so, then be explicit. "all incoming and outgoing edges" \equiv "all adjacent edges" the first term is for CS people, the second for mathematicians but they say the same thing.
- New section 7.5.4 -- this is one gigantor sentence: To provide better possibilities for optimization by the MPI library, the distributed graph constructors permit weighted communication edges, and take an info argument that can further influence process reordering or other optimizations performed by the MPI library, for instance by hinting on how the edge weights are to be interpreted, on the quality of the reordering, and/or the time permitted for the MPI library to process the graph.[[BR]] I suggest breaking it up thusly:[[BR]] To provide better possibilities for optimization by the MPI library, the distributed graph constructors permit weighted communication edges
,and take an info argument that can further influence process reordering or other optimizations performed by the MPI library. ~~for instance by hinting ~~ For example, hints can be provided on how the edge weights are to be interpreted,onthe quality of the reordering, and/or the time permitted for the MPI library to process the graph. fixed- Do isolated processes pass NULL arrays / 0 values for the length? yes, of course. Like a send with a zero count, i.e., input arrays are not significant.
- In new section 7.5.4: "The size of comm_dist_graph is identical to the size of comm_old." That sounds weird -- it almost sounds like we're assigning the size of the data structure. Instead, should you say "The number of processes in comm_dist_graph is identical to the number of processes in comm_old."? Same for both new graph constructors. changed
- Missing word and extra comma: "However, the exact meaning of edge weights is not specified by the MPI standard
,and is left to the implementation." Same for both new graph constructors. changed- This statement is not accurate because there is no C++ equivalent: "It is erroneous to supply MPI_UNWEIGHTED for some but not all processes' sourceweights and destweights arrays in comm_old." Perhaps the following would be better "Either zero or all processes must specify no sourceweights and destweights; it is erroneous for only some processes to specify no sourceweights and destweights." fixed
- Typos: "Info arguments can be used to guide the mapping
,; possible optionsare for exampleinclude minimizing the maximum number of edges between processes on different SMP nodes(for SMP systems), or minimizing the sum of all such edges." changed- Why are there 2 ATI's in a row? They should be merged into 1 ATI, even if that one ATI addresses multiple issues. fixed
- "However, all processes can
easilyconstruct the full topology from the distributed..." ok- "In both cases above, the
userapplication could supply MPI_UNWEIGHTED instead of explicitly providing identical weights. " fixed- C++ binding mistake:
#!c void MPI::Distgraphcomm::Get_dist_neighbors_count(int rank, int indegree[], int outdegree[], bool *weighted) const
should be (change bool * to bool &):
#!c void MPI::Distgraphcomm::Get_dist_neighbors_count(int rank, int indegree[], int outdegree[], bool &weighted) const
please deprecate them - they're inconsistent and useless
- Missing space: "The number of edges in
to and out of the process returned by ..." see Bronis' comment - Comma should be a period: "Communication is required at the collective MPI_DIST_GRAPH_CREATE call in order to compute the neighbor lists for each process from the distributed graph specification
,. " fixed- Missed addition: Change p456:1-2: There are
fivesix different types of communicators: MPI::Comm, MPI::Intercomm, MPI::Intracomm, MPI::Cartcomm,andMPI::Graphcomm, and MPI::Distgraphcomm." added - good catch- There's a "Replace..." with no text specified to replace (...actually, it looks like some text got cut-n-pasted out of order in the proposal -- the following text is better because there is a missing comma after "Graphcomm" in the above text): Page 457, line 34, replace:[[BR]] // Cartcomm and Graphcomm are similarly defined[[BR]] with[[BR]] // Cartcomm, Graphcomm, and Distgraphcomm are similarly defined fixed!
- The old graph interface should definitely be deprecated so that it can be removed in 3.0. yes, but that's not my decision :)
Updated proposal - Thanks, Torsten
Originally by htor on 2009-04-07 08:52:03 -0500
Attachment added: virtual_graph_adjacent.c
(0.7 KiB)
adjacency implementation
Originally by moody20 on 2009-04-07 09:02:32 -0500
This statement sounds ambiguous -- it sounds like each process has to specify all edges in the graph:
"MPI_DIST_GRAPH_CREATE_ADJACENT creates a distributed graph communicator with each process specifying all incoming and outgoing ..."
maybe just insert "of its" in there?
"... specifying all of its incoming and outgoing ..."
Originally by htor on 2009-04-07 09:12:00 -0500
added "of its" - review counter not reset!
Originally by hubertritzdorf on 2009-04-07 09:13:09 -0500
Reviewed. OK for me
Originally by htor on 2009-04-07 09:43:05 -0500
changed to reviewed after multiple reviews (counter was not reset by small changes as proposed by Bronis).
Originally by moody20 on 2009-04-07 10:01:30 -0500
I know we've changed "legal" to "valid" in other places... we should do that here I suppose, as well.
"An MPI implementation is not obliged to follow specific hints, and it is legal for an MPI implementation not to do any reordering."
Originally by jsquyres on 2009-04-07 10:04:03 -0500
I would highly encourage cutting-n-pasting the "deprecate this" portions of this ticket as a new ticket and getting it reviewed today. It's very easy to do; 9/10 of the work is already done.
I would do it myself, but I have a dozen other tickets I'm trying to shepherd through...
Originally by moody20 on 2009-04-07 10:21:22 -0500
In the adjacent case, it's clear that MPI_UNWEIGHTED can be used for both weight arrays, but it's not clear that it must be used for both arrays (can I use MPI_UNWEIGHTED for just the sources). I think this detail can be cleared up with some simple rephrasing.
Current text:
"In C or Fortran, an application can supply the special value MPI_UNWEIGHTED for both weight arrays to indicate that all edges have the same (effectively no) weight. In C++, this constant does not exist and the weight arguments may be omitted from the argument list. It is erroneous to supply MPI_UNWEIGHTED, or in C++ omit the weight arrays, for some but not all processes' sourceweights and destweights arrays in comm_old."
New suggestion:
"In C or Fortran, an application can indicate that all edges have the same (effectively no) weight by specifying the special value MPI_UNWEIGHTED for both weight arrays at all processes. In C++, this can be accomplished by omitting both weight arrays from the call at all processes. It is erroneous to supply MPI_UNWEIGHTED, or in C++ to omit the weight arrays, at some but not all processes in comm_old."
Originally by rsthakur on 2009-04-07 10:37:17 -0500
Just noticed a couple of typos. In the new Example 7.3, MPI_GRAPH_CREATE and MPI_GRAPH_CREATE_ADJACENT should be MPI_DIST_GRAPH_CREATE and MPI_DIST_GRAPH_CREATE_ADJACENT
Also 3rd line below defn of MPI_Dist_graph_neighbors, MPI_GRAPH_CREATE_ADJACENT should be MPI_DIST_GRAPH_CREATE_ADJACENT.
Maybe it is worth doing a search to make sure there are no more.
Originally by traff on 2009-04-07 11:26:53 -0500
Typos that Rajeev's detected should be fixed; I'm not doing it now for fear of conflict. Otherwise, ok with me (hard to read this text anymore)
Originally by htor on 2009-06-01 17:41:47 -0500
Fixed the typos mentioned by Rajeev (see diff)! Thanks and sorry for the delay.
Originally by traff on 2009-06-01 17:57:22 -0500
Missed entry for change log: MPI_DIST_GRAPH_CREATE_ADJACENT
Originally by RolfRabenseifner on 2009-06-07 16:18:38 -0500
All Change-Log entries must be of the form:
Location.NEWLINE[[BR]]
[[BR]]
Text.
Therefore, I propose the following formal change (with same content as yours)
Section 7.5.3 on page 247.[[BR]]
[[BR]]
New functions for a scalable distributed graph topology interface have been added.
In this section, the functions MPI_DIST_GRAPH_CREATE_ADJACENT and MPI_DIST_GRAPH_CREATE, the constant MPI_UNWEIGHTED, and the derived C++ class Distgraphcomm were added.
Section 7.5.4 on page 252.[[BR]]
[[BR]]
For the scalable distributed graph topology interface, the functions MPI_DIST_NEIGHBORS_COUNT and MPI_DIST_NEIGHBORS and the constant MPI_DIST_GRAPH were added.
Originally by traff on 2008-10-10 08:27:53 -0500
Fix Scalability Issues in Graph Topology InterfaceAuthor: Torsten Hoefler, Jesper Larsson Traeff with comments from Rolf
Rabenseifner, Hubert Ritzdorf, and Rajeev Thakur
Status
Sep 8 2008 Straw votes
For a scalable description
|| Yes || Maybe || No for 2.2 || Abstain || || 7 || 20 ||0 || 3 ||
For adding info to weight of edges
|| Yes || Maybe || No for 2.2 || Abstain || || 4 || 22 || 1 || 6 ||
October 2008 Straw Vote
|| Yes || Maybe || No for 2.2 || Abstain || || 23 || 0 || 0 || 5 ||
April 09 Straw Votes
Fully flexible version vs. only adjacent version.
|| only flexible || only adjacent || both || || 9 || 2 || 8 ||
Using MPI_Info arguments for any assertions (that change the semantics) is not in the spirit of MPI, thus, we can't use Info for neither asserting that each process specifies the whole graph or all processes specified reorder or not.
Reorder collective or not?
|| collective || not collective || || 5 || 6 ||
It's close, but not collective is also more consistent with the current standard. We can change this in MPI-3 easily (while keeping downward compatibility).
Description
The graph interface in MPI-1 has some serious (and unnecessary) scalability issues that prohibit efficient usage on large-scale machines.
The new scalable interface is essentially a single function, that is complemented by two inquiry functions. The concrete proposal introduces two new constants, and entails a few other changes related to the topology functionality.
See additional proposal/discussion in Ticket #147.
History
This interface was introduced in MPI-1 designed based on some older parallel library (Alexander Supalov can provide details). However, the interface was never changed. Right now, each process has to know the whole topology during graph creation and every process can query the neighbors of every other process in the communicator with a local operation. This makes it necessary to store the whole topology at every process locally which might be a prohibitive waste of memory resources.
The proposal is to introduce new functionality for specifying topologies by distributed graphs, and deprecate the current MPI 2.1 graph topology functionality.
Proposed Solution
Page 15, line 5, add:
MPI_UNWEIGHTED
Page 195, line 27, add:
MPI::Distgraphcomm MPI::Distgraphcomm::Dup() const
Page 195, line 35, add:
MPI::Distgraphcomm MPI::Distgraphcomm::Clone() const
Page 242:15-22 - remove sentence
"Edges in the communication graph are not weighted, so that processes are either simply connected or not connected at all."
and (outdated) rationale.
Page 243, line 3: Replace
"The functions
MPI_GRAPH_CREATE
andMPI_CART_CREATE
are used"by
"The functions
MPI_GRAPH_CREATE
,MPI_DIST_GRAPH_CREATE_ADJACENT
,MPI_DIST_GRAPH_CREATE
andMPI_CART_CREATE
are used"Page 243, line 8, replace:
"All input arguments must have identical values on all processes of the group of comm_old. A new communicator ..."
by:
"For
MPI_GRAPH_CREATE
andMPI_CART_CREATE
, all input arguments must have identical values on all processes of the group of comm_old. ForMPI_DIST_GRAPH_CREATE_ADJACENT
andMPI_DIST_GRAPH_CREATE
the input communication graph is distributed across the calling processes. Therefore the processes provide different values for the arguments specifying the graph. However, all processes must give the same value for reorder and the info argument. In all cases, a new communicator ..."Page 243, line 33 (before last sentence), add:
For distributed graphs, the functions
MPI_DIST_NEIGHBORS_COUNT
andMPI_DIST_NEIGHBORS
can be used to extract the neighbors of the calling node.Page 247, after line 48, add new section:
7.5.4 Distributed (Graph) Constructor
The general graph constructor assumes that each process passes the full (global) communication graph to the call. This limits the scalability of this constructor. With the distributed graph interface, the communication graph is specified in a fully distributed fashion. Each process specifies only the part of the communication graph of which it is aware. Typically, this could be the set of processes from which the process will eventually receive or get data, or the set of processes to which the process will send or put data, or some combination of such edges. Two different interfaces can be used to create a distributed graph topology.
MPI_DIST_GRAPH_CREATE_ADJACENT
creates a distributed graph communicator with each process specifying all of its incoming and outgoing (adjacent) edges in the logical communication graph and thus requires minimal communication during creation.MPI_DIST_GRAPH_CREATE
provides full flexibility, and processes can indicate that communication will occur between other pairs of processes.To provide better possibilities for optimization by the MPI library, the distributed graph constructors permit weighted communication edges and take an info argument that can further influence process reordering or other optimizations performed by the MPI library. For example, hints can be provided on how edge weights are to be interpreted, the quality of the reordering, and/or the time permitted for the MPI library to process the graph.
MPI_DIST_GRAPH_CREATE_ADJACENT
returns a handle to a new communicator to which the distributed graph topology information is attached.Each process passes all information about the edges to its neighbors in the virtual distributed graph topology. The calling processes must ensure that each edge of the graph is described in the source and in the destination process with the same weights. If there are multiple edges for a given (source,dest) pair, then the sequence of the weights of these edges does not matter. The complete communication topology is the combination of all edges shown in the
sources
arrays of all processes incomm_old
, which must be identical to the combination of all edges shown in thedestinations
arrays. Source and destination ranks must be process ranks ofcomm_old
. This allows a fully distributed specification of the communication graph. Isolated processes (i.e., processes with no outgoing or incoming edges, that is, processes that have specifiedindegree
andoutdegree
as zero and that thus do not occur as source or destination rank in the graph specification) are allowed.The call creates a new communicator
comm_dist_graph
of distributed graph topology type to which topology information has been attached. The number of processes incomm_dist_graph
is identical to the number of processes incomm_old
. The call toMPI_DIST_GRAPH_CREATE_ADJACENT
is collective.Weights are specified as non-negative integers and can be used to influence the process remapping strategy and other internal MPI optimizations. For instance, approximate count arguments of later communication calls along specific edges could be used as their edge weights. Multiplicity of edges can likewise indicate more intense communication between pairs of processes. However, the exact meaning of edge weights is not specified by the MPI standard and is left to the implementation. In C or Fortran, an application can supply the special value
MPI_UNWEIGHTED
for both weight arrays to indicate that all edges have the same (effectively no) weight. In C++, this constant does not exist and the weight arguments may be omitted from the argument list. It is erroneous to supplyMPI_UNWEIGHTED
, or in C++ omit the weight arrays, for some but not all processes'sourceweights
anddestweights
arrays incomm_old
. Note thatMPI_UNWEIGHTED
is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be aNULL
. In Fortran,MPI_UNWEIGHTED
is an object likeMPI_BOTTOM
(not usable for initialization or assignment). See Section 2.5.4.The meaning of the
info
andreorder
arguments is defined in the description of the following routine.MPI_DIST_GRAPH_CREATE
returns a handle to a new communicator to which the distributed graph topology information is attached. Concretely, each process calls the constructor with a set of directed(source,destination)
communication edges as described below. Every process passes an array ofn
source nodes in thesources
array. For each source node, a non-negative number of destination nodes is specified in thedegrees
array. The destination nodes are stored in the corresponding consecutive segment of thedestinations
array. More precisely if thei
-th node insources
iss
, this specifiesdegrees[i]
edges(s,d)
withd
of thej
-th such edge stored indestinations[degrees[0]+...+degrees[i-1]+j]
. The weight of this edge is stored inweights[degrees[0]+...+degrees[i-1]+j]
. Both thesources
and thedestinations
arrays may contain the same node more than once, and the order in which nodes are listed as destinations or sources is not significant. Similarly, different processes may specify edges with the same source and destination nodes. Source and destination nodes must be process ranks ofcomm_old
. Different processes may specify different numbers of source and destination nodes, as well as different source to destination edges. This allows a fully distributed specification of the communication graph. Isolated processes (i.e., processes with no outgoing or incoming edges, that is, processes that do not occur as source or destination node in the graph specification) are allowed.The call creates a new communicator
comm_dist_graph
of distributed graph topology type to which topology information has been attached. The number of processes incomm_dist_graph
is identical to the number of processes incomm_old
. The call toMPI_Dist_graph_create
is collective.If
reorder
=false
, all processes will have the same rank incomm_dist_graph
as incomm_old
. Ifreorder
=true
then the MPI library is free to remap to other processes (ofcomm_old
) in order to improve communication on the edges of the communication graph. The weight associated with each edge is a hint to the MPI library about the amount or intensity of communication on that edge, and may be used to compute a "best" reordering.Weights are specified as non-negative integers and can be used to influence the process remapping strategy and other internal MPI optimizations. For instance, approximate count arguments of later communication calls along specific edges could be used as their edge weights. Multiplicity of edges can likewise indicate more intense communication between pairs of processes. However, the exact meaning of edge weights is not specified by the MPI standard and is left to the implementation. In C or Fortran, an application can supply the special value
MPI_UNWEIGHTED
for the weight array to indicate that all edges have the same (effectively no) weight. In C++, this constant does not exist and the weights argument may be omitted from the argument list. It is erroneous to supplyMPI_UNWEIGHTED
, or in C++ omit the weight arrays, for some but not all processes ofcomm_old
. Note thatMPI_UNWEIGHTED
is not a special weight value; rather it is a special value for the total array argument. In C, one would expect it to be aNULL
. In Fortran,MPI_UNWEIGHTED
is an object likeMPI_BOTTOM
(not usable for initialization or assignment). See Section 2.5.4.The meaning of the
weights
argument can be influenced by theinfo
argument. Info arguments can be used to guide the mapping; possible options include minimizing the maximum number of edges between processes on different SMP nodes, or minimizing the sum of all such edges. An MPI implementation is not obliged to follow specific hints, and it is legal for an MPI implementation not to do any reordering. An MPI implementation may specify moreinfo
key-value pairs. All processes must specify the same set of key-valueinfo
pairs.-Advice to implementors.* MPI implementations must document any additionally supported key-value
info
pairs.MPI_INFO_NULL
is always valid, and may indicate the default creation of the distributed graph topology to the MPI library.An implementation does not explicitly need to construct the topology from its distributed parts. However, all processes can construct the full topology from the distributed specification and use this in a call to
MPI_GRAPH_CREATE
to create the topology. This may serve as a reference implementation of the functionality, and may be acceptable for small communicators. However, a scalable high-quality implementation would save the topology graph in a distributed way. (End of advice to implementors.)Example 7.3
As for Example 7.2., assume there are four processes 0, 1, 2, 3 with the following adjacency matrix and unit edge weights:
|| process || neighbors || || 0 || 1, 3 || || 1 || 0 || || 2 || 3 || || 3 || 0, 2 ||
With
MPI_DIST_GRAPH_CREATE
, this graph could be constructed in many different ways. One way would be that each process specifies its outgoing edges. The arguments per process would be:|| process || n || sources || degrees || destinations || weights || || 0 || 1 || 0 || 2 || 1,3 || 1,1 || || 1 || 1 || 1 || 1 || 0 || 1 || || 2 || 1 || 2 || 1 || 3 || 1 || || 3 || 1 || 3 || 2 || 0,2 || 1,1 ||
Another way would be to pass the whole graph on process 0, which could be done with the following arguments per process: || process || n || sources || degrees || destinations || weights || || 0 || 4 || 0,1,2,3 || 2,1,1,2 || 1,3,0,3,0,2 || 1,1,1,1,1,1 || || 1 || 0 || - || - || - || - || || 2 || 0 || - || - || - || - || || 3 || 0 || - || - || - || ||
In both cases above, the application could supply
MPI_UNWEIGHTED
instead of explicitly providing identical weights.MPI_DIST_GRAPH_CREATE_ADJACENT
could be used to specify this graph using the following arguments: || process || indegree || sources || sourceweights || outdegree || destinations || destweights || || 0 || 2 || 1,3 || 1,1 || 2 || 1,3 || 1,1 || || 1 || 1 || 0 || 1 || 1 || 0 || 1 || || 2 || 1 || 3 || 1 || 1 || 3 || 1 || || 3 || 2 || 0,2 || 1,1 || 2 || 0,2 || 1,1 ||Example 7.4
A two-dimensional PxQ torus where all processes communicate along the dimensions and along the diagonal edges. This cannot be modelled with Cartesian topologies, but can easily be captured with
MPI_DIST_GRAPH_CREATE
as shown in the following code. In this example, the communication along the dimensions is twice as heavy as the communication along the diagonals:Page 248, line 21, add (i.e., after MPI_GRAPH):
MPI_DIST_GRAPH
distributed graph topologyPage 252, line 6, add:
MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS provide adjacency information for a distributed graph topology.
These calls are local. The number of edges into and out of the process returned by
MPI_DIST_GRAPH_NEIGHBORS_COUNT
are the total number of such edges given in the call toMPI_DIST_GRAPH_CREATE_ADJACENT
orMPI_DIST_GRAPH_CREATE
(potentially by processes other than the calling process in the case ofMPI_DIST_GRAPH_CREATE
). Multiply defined edges are all counted and returned byMPI_DIST_GRAPH_NEIGHBORS
in some order. IfMPI_UNWEIGHTED
is supplied forsourceweights
ordestweights
or both, or ifMPI_UNWEIGHTED
was supplied during the construction of the graph then no weight information is returned in that array or those arrays. The only requirement on the order of values insources
anddestinations
is that two calls to the routine with same input argumentcomm
will return the same sequence of edges. Ifmaxindegree
ormaxoutdegree
is smaller than the numbers returned by MPI_DIST_GRAPH_NEIGHBOR_COUNT, then only the first part of the full list is returned. Note, that the order of returned edges does need not to be identical to the order that was provided in the creation of comm for the case thatMPI_DIST_GRAPH_CREATE_ADJACENT
was used.-Advice to implementors:* Since the query calls are defined to be local, each process needs to store the list of its neighbors with incoming and outgoing edges. Communication is required at the collective
MPI_DIST_GRAPH_CREATE
call in order to compute the neighbor lists for each process from the distributed graph specification. (End of advice to implementors.)Page 450, line 13, add:
class Distgraphcomm : public Intracomm {...};
Page 456, line 1:
replace: "five" with "six"
Page 456, line 5:
replace "MPI::Cartcomm and MPI::Graphcomm are derived from MPI::Intracomm."
with
MPI::Cartcomm, MPI::Graphcomm, and MPI::Distgraphcomm are derived from MPI::Intracomm.
Page 457, line 11, add:
Distgraphcomm& Distgraphcomm::Clone() const
Page 457, line 34, replace:
// Cartcomm and Graphcomm are similarly defined
with:
// Cartcomm, Graphcomm, and Distgraphcomm are similarly defined
Page 462, line 36, replace: "MPI_STATUS_IGNORE, MPI_STATUSES_IGNORE, MPI_ERRCODES_IGNORE,"
with
"MPI_STATUS_IGNORE, MPI_STATUSES_IGNORE, MPI_ERRCODES_IGNORE, MPI_UNWEIGHTED,"
Page 465, line 14, 15: addressed in #104
by:
// Cartcomm, Graphcomm and Distgraphcomm are similarly defined
Page 497 (Appendix A), line 9, add:
MPI_DIST_GRAPH MPI::DIST_GRAPH
Page 499, after line 12:
add "
MPI_UNWEIGHTED Not defined in C++
"Page 500, line 6, add:
MPI::Distgraphcomm
Page 550, line 27, add:
Distgraphcomm& Distgraphcomm::Clone() const
Page 551, line 13, add:
Distgraphcomm& Distgraphcomm::Dup() const
Functions to Deprecate (not included in this ticket but kept for reference!)
The following functions may be deprecated by this proposal (for MPI 2.2 or MPI 3.0). The proposal is not dependent on whether this happens or not, and the discussion on this should be kept separate.
Since MPI_DIST_GRAPH_CREATE is collective, it would be a small matter to compute the total number of edges. Thus, MPI_GRAPHDIMS_GET could be kept. Note that nnodes is equal to the number of processes in comm.
Possible additional functionality (not included in this ticket but kept for future reference)
To be discussed:
Impact on Implementations
Implementations have to add the new functions, which can easily be implemented on top of the existing, general graph interface. This trivial implementation requires little work.
Impact on Applications / Users
Users who want to use the new functionality will have to change calls to graph creation and neighbor query functions. They have to introduce explicit communication of neighbor lists if they query other processes neighbors (this is unlikely).
Alternative Solutions
For more discussion/connection to topological collectives, see [wiki:TopoColl]
Keep for later (maybe introduction in MPI-3): Advice to users. Specifying
reorder
#false
on only some processes can support heterogeneous application to hardware mapping. For example, suppose an application does visualization or heavy I/O on rank 0, and that only this processor has the corresponding visualization or I/O capabilities. If this rank should be part of a distributed graph topology, it is useful to specify that this rank will not be remapped. Allowing remapping of other ranks can then exploit features of the interconnection network topology. Another scenario is an application where some ranks have large state information that would be (too) expensive to migrate. Settingreorder``false
for these ranks guarantees that migration will not be necessary since these processes will preserve their rank in the distributed graph communicator. (End of advice to users.)-Advice to implementors.* The reordering optimization problem must account for the constraint that processes with
reorder
=false
must remain fixed. This could be done by eliminating the fixed nodes and all their edges, and solving the optimization problem on the reduced graph. However, better results may be possible by considering the entire graph. (End of advice to implementors.)Entry for the Change Log
Section 7.5.3 on page 247. New functions for a scalable distributed graph topology interface has been added. In this section, the functions MPI_DIST_GRAPH_CREATE_ADJACENT and MPI_DIST_GRAPH_CREATE, the constants MPI_UNWEIGHTED, and the derived C++ class Distgraphcomm were added.
Section 7.5.4, page 252. For the scalable distributed graph topology interface, the functions MPI_DIST_NEIGHBORS_COUNT and MPI_DIST_NEIGHBORS and the constant MPI_DIST_GRAPH were added.