Topology awareness in MPI_Dims_create

mpiforumbot commented 8 years ago

Originally by balaji on 2009-10-09 16:45:51 -0500

Description

MPI_Dims_create does not take any information with respect to the processes in its parameter (e.g., there is no communicator parameter). Without this information the function cannot really use any information about the topology of the physical network to optimize the Cartesian layout. (This problem is answered by Alternative A)

Additionally, MPI_Dims_create is not aware of the application topology. E.g., if the application uses a grid of 1200 x 500 elements and 60 MPI processes (as in ticket #195) then the expected Cartesian process topology should be 12 x 5 and not 10 x 6 as returned by existing MPI_Dims_create. (Both problems are answered by solution B)

Extended Scope

None.

History

Proposed Solution

-MPI-2.2, Chapter 7 (Topologies), 7.4 (Overview), page 247, line 9-17 read*

MPI_CART_CREATE can be used to describe Cartesian structures of arbitrary dimension. For each coordinate direction one specifies whether the process structure is periodic or not. Note that an n-dimensional hypercube is an n-dimensional torus with 2 processes per coordinate direction. Thus, special support for hypercube structures is not necessary. The local auxiliary function MPI_DIMS_CREATE can be used to compute a balanced distribution of processes among a given number of dimensions.

Rationale. Similar functions are contained in EXPRESS [12] and PARMACS. (End of rationale.)

-but should read*

MPI_CART_CREATE can be used to describe Cartesian structures of arbitrary dimension. For each coordinate direction one specifies whether the process structure is periodic or not. Note that an n-dimensional hypercube is an n-dimensional torus with 2 processes per coordinate direction. Thus, special support for hypercube structures is not necessary. The local auxiliary function MPI_DIMS_CREATE and the collective function MPI_DIMS_CREATEV can be used to compute ~~a balanced~~ an optimized distribution of processes among a given number of dimensions.

Rationale. Similar functions are contained in EXPRESS [12] and PARMACS. (End of rationale.)

-MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 248, line 38-44 read*

7.5.2 Cartesian Convenience Function: MPI_DIMS_CREATE

For Cartesian topologies, the function MPI_DIMS_CREATE helps the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

-but should read*

7.5.2 Cartesian Convenience Functions~~: MPI_DIMS_CREATE~~

For Cartesian topologies, the functions MPI_DIMS_CREATE and MPI_DIMS_CREATEV helps the user select ~~a balanced~~ an optimized distribution of processes per coordinate direction, depending on the number of processes in the group to be ~~balanced~~ optimized and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

-_Add after MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 249, line 35, i.e. after MPI_DIMSCREATE:*

MPI_DIMS_CREATEV(comm,ndims,gridsizes,adapt,dims)
 IN     comm      communicator that defines the number 
                  nodes and the basis for the 
                  hardware adaption (handle)
 IN     ndims     number of Cartesian dimensions 
                  (positive integer)
 IN     gridsizes integer array of size ndims specifying
                  the application size in each dimension
 IN     info      hints on optimization if adapt=true (handle)
 INOUT  dims      integer array of size ndims specifying
                  the number of nodes in each dimension

int MPI_Dims_createv(MPI_Comm comm, int ndims, 
             int gridsizes[], MPI_Info info, int dims[])

MPI_DIMS_CREATEV(COMM, NDIMS, GRIDSIZES, INFO, DIMS, IERROR)
    INTEGER COMM, NDIMS, GRIDSIZES(*), INFO, DIMS(*), IERROR

The entries in the array dims are set to describe a Cartesian grid with ndims dimensions and a total number of nodes identical to the size of the group of comm. The caller may constrain the operation of this routine by specifying elements of array dims. If dims[i] is set to a positive number, the routine will not modify the number of nodes in dimension i; only those entries where dims[i] # 0 will be modified. Those dimensions where dims[i]0 are set such that __the returned values will allow MPI_CART_CREATE to return a communicator where neighbor communication is efficient ~~all ratios gridsizes[i]/dims[i] (using floating point divide) are as close to each other as possible, using an appropriate divisibility algorithm. The largest ratio divided by the smallest ratio should be minimal. If there are different possibilities for calculating these dims[i], then the solution set with minimal difference between largest and smallest dims[i] should be used~~. If i<j, gridsize[i]=gridsize[j], dims[i]=0, and dims[j]=0 at input, then the output dims shall satisfy this property: dims[i]>=dims[j].

Negative input values of dims[i] are erroneous. An error will occur if the size of the group of comm is not a multiple of the product of all dims[i] with dims[i]>0.

The array dims is suitable for use as input to MPI_CART_CREATE. MPI_DIMS_CREATEV is a collective operation. The arguments comm, ndims, gridsizes~~, adapt~~ and the (key,value)-pairs stored in info must have identical values on all processes of the group of comm. MPI_DIMS_CREATEV will return in dims the same results on all proccesses.

The ~~meaning of adapt=true~~ returned result can be additionally influenced by the info argument. Valid info values are implementation-dependent. The constant MPI_INFO_NULL can be used when no hints should be specified. An MPI implementation is not obliged to follow specific hints, and it is valid for an MPI implementation to ignore the hints.

Example 7.2 MPI_DIMS_CREATEV on a system with 12 cores per SMP node and 1 MPI process per core, and the following input values: nnodes=2400, ndims=3, gridsizes=(120,80,250), dims=(0,0,0). Possible output values and their implication for MPI_CART_CREATE and the underlying communication.

|| dims after the call || possible cores x nodes || gridpoints per process || gridpoints per node || on 2-dim boundary^1)^ || sum of communicated points || || (12,8, 25) || (12x1,1x8, 1x25) || (10,10,10) || (120,10,10) || (100^3)^,1200,1200) || 2400 || || (12,8, 25) || (6x2, 2x4, 1x25) || (10,10,10) || (60,20,10) || ( 200, 600,1200) || 2000 || || (12,8, 25) || (3x4, 4x2, 1x25) || (10,10,10) || (30,40,10) || ( 400, 300,1200) || 1900 || || (10,10,24) || (2x5, 2x5, 3x8 ) || (12,8,11^2)^) || (24,16,33) || ( 528, 792, 384) || 1704 || || (10,10,24) || (1x10,2x5, 6x4 ) || (12,8,11^2)^) || (12,16,66) || (1056, 792, 192) || 2040 || || (10,10,24) || (2x5, 1x10,6x4 ) || (12,8,11^2)^) || (24,8, 66) || ( 528,1584, 192) || 2304 || || (10,10,24) || (1x10,1x10,12x2) || (12,8,11^2)^) || (12,8,132) || (1056,1584, 96) || 2736 ||

^1)^ In each dimension, calculated as product of the "gridpoints per node" in the other dimensions.

^2)^ 1.0*250/24 rounded up.

^3)^ no node-to-node communication in this direction because the number of nodes is only 1.

Impact on Implementations

The new routine must be implemented.

Impact on Applications / Users

The proposal is source-code and binary (ABI) backward compatible with MPI-2.2. Application that currently use an own algorithm for calculating MPI process dims based on application grid dimensions may benefit from this new routine.

Alternative Solutions

None.

Entry for the Change Log

Section 7.5.2 on page 248.[[BR]] New routine MPI_DIMS_CREATEV.

mpiforumbot commented 8 years ago

Originally by gropp on 2009-10-11 14:12:40 -0500

Minor changes in wording and two spelling fixes.

mpiforumbot commented 8 years ago

Originally by gropp on 2009-10-11 14:17:00 -0500

Added more rationale.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2009-11-12 15:04:09 -0600

First step, make it readable on a normal screen size.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2009-11-12 16:27:48 -0600

As proposed in the discussion at Portland.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2009-11-12 16:39:53 -0600

New name for this new routine

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2009-11-12 17:53:07 -0600

Corrected and with example

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-17 11:20:18 -0600

After the discussion on ticket #314, I propose

add before Alternative A: See Alternative B = Proposed Solution
change headline "Alternative A" into "Alternative A (Withdrawn)"
change headline "Alternative B" into "Alternative B = Proposed Solution"
add after "Alternative B = Proposed Solution" the following line

**MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 248, line 38-44 read**

7.5.2 Cartesian Convenience Function: MPI_DIMS_CREATE

For Cartesian topologies, the function MPI_DIMS_CREATE helps the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

**but should read**

7.5.2 Cartesian Convenience Function`s**: MPI_DIMS_CREATE**(V)`

For Cartesian topologies, the function MPI_DIMS_CREATE `and MPI_DIMS_CREATEV` help~~s~~ the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

**Add after MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 249, line 35, i.e. after MPI_DIMS_CREATE:**

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-18 04:12:58 -0600

Pavan,

I propose to add an additional example with adapt=true:

Example 7.3 With adapt=true on a system with 12 cores per SMP node and 1 MPI process per core, and the following input values: nnodes=2400, ndims=3, gridsizes=(120,80,250), dims=(0,0,0). Possible output values and there implication for MPI_CART_CREATE and the underlying communication.

||dims after the call || possible cores x nodes || gridpoints per process || gridpoints per node || on 2-dim boundary^1)^ || sum of communicated points || ||(12,8, 25) ||(12x1,1x8, 1x25) ||(10,10,10) ||(120,10,10) ||(100^3)^,1200,1200) ||2400 || ||(12,8, 25) ||(3x4, 4x2, 1x25) ||(10,10,10) ||(30,40,10) ||( 400, 300,1200) ||1900 || ||(10,10,24) ||(2x5, 2x5, 3x8 ) ||(12,8,11^2)^)||(24,16,33) ||( 528, 792, 384) ||1704 || ||(10,10,24) ||(1x10,2x5, 6x4 ) ||(12,8,11^2)^)||(12,16,66) ||(1056, 792, 192) ||2040 || ||(10,10,24) ||(2x5, 1x10,6x4 ) ||(12,8,11^2)^)||(24,8, 66) ||( 528,1584, 192) ||2304 || ||(10,10,24) ||(1x10,1x10,12x2) ||(12,8,11^2)^)||(12,8,132) ||(1056,1584, 96) ||2736 ||

^1)^ In each dimension, calculated as product of the "gridpoints per node" in the other dimensions.

^2)^ 1.0*250/24 rounded up.

^3)^ no node-to-node communication in this direction because the number of nodes is only 1.

Although the decision about the factorization of each dims[i] into cores[i] x nodes[i] is part of MPI_CART_CREATE, the factorization within MPI_DIMS_CREATEV with adapt=true restricts already the possibilities for MPI_CART_CREATE. Although the first two results have both the perfect ratios (10.0,10.0,10.0), the third solution with the ratios (12.0, 8.0, 10.4) may have minimal communication time because the sum of the node-to-node communications in all 3 directions is proportional to only 1704 grid points.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-23 05:48:45 -0600

Update done, based on previous comments, and as discussed with Pavan.

I additional added an info argument.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-23 07:08:24 -0600

C and Fortran interface added.

Example 7.3. a bit extended and clearer wording.

mpiforumbot commented 8 years ago

Originally by gropp on 2012-01-23 08:03:32 -0600

The idea is good but there are some problems with the details.

The info argument is missing from the definitions
The language needs editing
What exactly does "as close to each other as possible" mean? Could I use different norms for evaluating this (e.g., 1-norm or infinity-norm)? Could those norms have weights?

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-23 09:37:27 -0600

Fixed the missing info in C and Fortran. My English should be fixed by native speakers. "close to" must be fixed. I'm not sure whether the second condition about the ratio of largest to smallest ratio is already enough.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-23 10:35:24 -0600

Replying to gropp:

The idea is good but there are some problems with the details.

The info argument is missing from the definitions Done.

The language needs editing Needs native speakers.

What exactly does "as close to each other as possible" mean? Could I use different norms for evaluating this (e.g., 1-norm or infinity-norm)?

This sentence is stolen from MPI_DIMS_CREATE:

MPI-2.2 page 249, lines 16-17: "The dimensions are set to be as close to each other as possible, using an appropriate divisibility algorithm."

Example: Input is nnodes=15360, ndims=4.

Possible results may be

(16,12,10,8) and
(16,15,8,8), i.e. both with largestsmallest = 2

The neighbor ratios are

(1.333, 1.20, 1.25) with avg=1.261 and max=1.333 (best)
(1.067, 1.875, 1.0) with avg=1.314 and max=1.875

More complicated nnodes=1843200, ndims=6:

(8 10 10 12 12 16) -> (1.25 1 1.2 1 1.3333) -> avg=1.1567 max=1.33
(8 8 10 12 15 16) -> (1 1.25 1.2 1.25 1.0667) -> avg=1.1533 max=1.25 (best)[[BR]] Remark: although nnodes=2**13 * 3**2 * 5**2, the best result does not contain any "2" in the 5th dims (=15).

Our Cray MPI (mpich2 based) returns the wrong result. The algorithm in OpenMPI seems to do the same.

I'm not sure, whether there may be an example where both norms produce different results for MPI_DIMS_CREATE.

But it may be, that this is possible for the V-version MPI_DIMS_CREATEV.

Another question for adapt=false is, whether we allow explicitely that an implementation chooses a worse solution (compared to an optimum based on the wording in the MPI standard) to allow fast (heuristic) algorithms.

mpiforumbot commented 8 years ago

Originally by goodell on 2012-01-23 14:16:08 -0600

editing pass over the language

mpiforumbot commented 8 years ago

Originally by goodell on 2012-01-23 14:22:09 -0600

I've made an editing pass over the text, but I'm still not sure what was intended by this sentence:

"All `dims[i]` modified by this call are returned in the same order as the `gridsize[i]` values, and if several `gridsize[i]` values are identical and then the in non-increasing order in the sequence of the indices."

I understand it's meant to disambiguate the correct choice among several otherwise valid options, but the first part still doesn't make sense to me.

Also, is the following sentence too restrictive?

The array dims is suitable for use as input to MPI_CART_CREATE. MPI_DIMS_CREATEV is local and will return the same results when called with the same input arguments on all processes of the group of comm.

This would seem to preclude any optimizations that involve any amount of randomization, since it's probably too impossible to make sure that the random number generator is used exactly the same way on all processes. This may not matter too much for the default adapt=false case, but the adapt=true case is probably excessively hampered this way. OTOH, if we remove this restriction then we are obligating the user to broadcast the resulting information themselves in order to have a consistent arguments at all processes for the call to MPI_Cart_create.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-24 03:57:25 -0600

Replying to goodell:

I've made an editing pass over the text, but I'm still not sure what was intended by this sentence:
"All `dims[i]` modified by this call are returned in the same order as the `gridsize[i]` values, and if several `gridsize[i]` values are identical and then the in non-increasing order in the sequence of the indices."
I understand it's meant to disambiguate the correct choice among several otherwise valid options, but the first part still doesn't make sense to me.

The first part may want to tell that the result in dims must not be ordered as in non-V MPI_DIMS_CREATE. This is a nice goal but the result may be not good:

This first part of the sentence implies that after finding the optimal dims result (i.e., with optimal ratios...), the routine has to reorder these values according to the sequence in gridsizes.

My understanding is that this would imply that the ratio rules are broken, i.e., the final result is worse than defined by the ratio rules (because the set of ratios is changed by such a reordering when gridsize-values are not identical).

There are two possibilities:

Such a reordering can never happen, because the ratio rules already produce such a sequence. Then, the sentence is not needed and may cause only additional implementation overhead because nobody wants too mathematically prove that this cannot happen. This means, the first part may be removed.
Such a reordering can happen and the result would be therefore worse. In this case, this first part should definitly removed.

Both possibilities imply that the first part may or must be removed. Therefore I recommend to substitute the sentence by

"If several `gridsize[i]` values are identical then for those indeces `i`, the `dims[i]` modified by this call are returned in non-increasing order in the sequence of the indices."

Okay?

Also, is the following sentence too restrictive?
The array dims is suitable for use as input to MPI_CART_CREATE. MPI_DIMS_CREATEV is local and will return the same results when called with the same input arguments on all processes of the group of comm. 
This would seem to preclude any optimizations that involve any amount of randomization, since it's probably too impossible to make sure that the random number generator is used exactly the same way on all processes. This may not matter too much for the default adapt=false case, but the adapt=true case is probably excessively hampered this way. OTOH, if we remove this restriction then we are obligating the user to broadcast the resulting information themselves in order to have a consistent arguments at all processes for the call to MPI_Cart_create.

We should not change this sentence, because this routine has one major goal, returning in dims the input for MPI_CART_CREATE. It is up to the application, whether MPI_DIMS_CREATE(V) is called only by one process and then dims being broadcast to all others, or whether MPI_DIMS_CREATE(V) is called collectively. For both routines, the same rule should apply to make it easy to move from MPI_DIMS_CREATE to MPI_DIMS_CREATEV if this is more optimal for the application.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-24 05:53:59 -0600

More worse: Classical MPI_DIMS_CREATE(240,2,dims) returns

Cray: 16*15 (correct)
OpenMPI 1.5.4: 20*12 (bad result - not as close as possible)
MVAPICH 1.6: 20*12 (bad result - not as close as possible)

I do not report this to tell bugs that should be corrected. I tell this, because we may want to allow heuristics, as already obviously used in all these implementations.

Proposal for both MPI_DIMS_CREATE and MPI_DIMS_CREATEV:

'''The returned result in the array dims may differ from the optimum because the implementation may use a faster implementation based on a heuristics.'''

This would also finish the discussion whether the text unambiguously defines the optimum.

mpiforumbot commented 8 years ago

Originally by goodell on 2012-01-24 13:08:43 -0600

Replying to RolfRabenseifner:

Both possibilities imply that the first part may or must be removed. Therefore I recommend to substitute the sentence by
"If several `gridsize[i]` values are identical then for those indeces `i`, the `dims[i]` modified by this call are returned in non-increasing order in the sequence of the indices."
Okay?

Not okay. I think I agree with eliminating the first part. But I still don't entirely understand what you mean even in the new version. Here's my attempt at cleaning it up into what I think you mean; you tell me if I got it right:

"If `i<j`, `gridsize[i]=gridsize[j]`, `dims[i]=0`, and `dims[j]=0` at input, then the output `dims` shall satisfy this property: `dims[i]>=dims[j]`.

Why decreasing versus increasing order?

Also, is the following sentence too restrictive?
The array dims is suitable for use as input to MPI_CART_CREATE. MPI_DIMS_CREATEV is local and will return the same results when called with the same input arguments on all processes of the group of comm. 
This would seem to preclude any optimizations that involve any amount of randomization, since it's probably too impossible to make sure that the random number generator is used exactly the same way on all processes. This may not matter too much for the default adapt=false case, but the adapt=true case is probably excessively hampered this way. OTOH, if we remove this restriction then we are obligating the user to broadcast the resulting information themselves in order to have a consistent arguments at all processes for the call to MPI_Cart_create.
We should not change this sentence, because this routine has one major goal, returning in dims the input for MPI_CART_CREATE. It is up to the application, whether MPI_DIMS_CREATE(V) is called only by one process and then dims being broadcast to all others, or whether MPI_DIMS_CREATE(V) is called collectively. For both routines, the same rule should apply to make it easy to move from MPI_DIMS_CREATE to MPI_DIMS_CREATEV if this is more optimal for the application.

I don't think I agree with that assessment, but I don't feel strongly enough to fight over it. Your interpretation means that calling MPI_DIMS_CREATEV must be very deterministic, even across processes. I think that is too strong of a restriction on the implementation. Let's see what the other chapter committee members think.

mpiforumbot commented 8 years ago

Originally by gropp on 2012-01-27 05:42:48 -0600

Its clearly the case that the intended use of MPI_DIMS_CREATE(V) is a input to MPI_CART_CREATE. That establishes the requirements on the the determinacy of the outputs based on the inputs. (Asking the user to run in one process an broadcast the results is both ugly and introduces unnecessary communication in almost all cases).

An alternative would be to give MPI_DIMS_CREATEV collective semantics. That is, require all processes in the input communicator to call it. In the typical use, this is not a problem. Further, the collective semantics does not require any communication - a deterministic implementation could execute purely locally. Should some ambitious implementation want to do something more, then it would have the ability to communicate to ensure that the output was usable to MPI_CART_CREATE.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-01-29 06:27:22 -0600

I did the following changes:

Two sub-proposals:
- B = local call
- C = collective call (as proposed by Bill Gropp)
Correction also to MPI-2.2 page 247 lines 9-17
Correction of

All dims[i] modified by this call are returned in the same order as the gridsize[i] values, and if several gridsize[i] values are identical and then the in non-increasing order in the sequence of the indices.

into (as proposed by Dave Goodell)

If i<j, gridsize[i]=gridsize[j], dims[i]=0, and dims[j]=0 at input, then the output dims shall satisfy this property: dims[i]>=dims[j].
ndims as "positive integer" instead of "integer"

I personally would prefere proposal C, because it allows both,

a determistic (non-parallelized and non-communicating) implementation, and
a parallelized implementation to achieve an optimum although the search space may be huge on a huge number of processes.

mpiforumbot commented 8 years ago

Originally by htor on 2012-02-01 22:07:15 -0600

I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

I disagree to the word "balanced" in the general description, the returned dimensions are not optimized for balance
"1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"
why is the constraint "If i<j, gridsize[i]=gridsize[j], dims[i]=0, and dims[j]=0 at input, then the output dims shall satisfy this property: dims[i]>=dims[j]." defined?
why would we ever say adapt = 0? It does not change program semantics and is "good" :-)
"An error will occur if the size of the group of comm is not a multiple of the product of all dims[i] with dims[i]>0." -- this doesn't seem to work because if this were true, then dims[i]==0 would need to remain 0.

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

Thanks, Torsten

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-02-05 06:56:08 -0600

Replying to htor:

I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

I disagree to the word "balanced" in the general description, the returned dimensions are not optimized for balance

I would keep the text because in with the new routine, the values of gridsizes[i]/dims[i] are balanced. We need not to be better than the current wording that uses "balanced" in combination with "optional constraints". In MPI-2.2, these optional constraints have been fixed dims[i] values. With MPI_DIMS_CREATEV in MPI-3.0, it is additionally the gridsizes[i]/dims[i] rule.

"1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"

This is to make clear, that the ratio is a float and not a rounded or cut integer division. To stay as precise as possible, I would keep it until a better proposal is done. A term float(gridsizes[i])/float(dims[i]) would be worse, because it is not language neutral.

why is the constraint "If i<j, gridsize[i]=gridsize[j], dims[i]=0, and dims[j]=0 at input, then the output dims shall satisfy this property: dims[i]>=dims[j]." defined?

To get the same behavior with gridsizes == 1 as with old MPI_DIMS_CREATE, and to achieve deterministic results in most cases.

why would we ever say adapt = 0? It does not change program semantics and is "good" :-)

To be able to substitute MPI_DIMS_CREATE by MPI_DIMS_CREATEV with gridsizes == 1, as long as the nnodes is identical to the size of a communicator and all processes in the communicator are calling old MPI_CREATE.

"An error will occur if the size of the group of comm is not a multiple of the product of all dims[i] with dims[i]>0." -- this doesn't seem to work because if this were true, then dims[i]==0 would need to remain 0.

I do not see a logical difference to the text of MPI_DIMS_CREATE, MPI-2.2 page 249 lines 21-22. Therefore, I would not make it more complicate. It would be possible to substitute the current sentence by

An error will occur if one ore more indexes i with dims[i]>0 exist and the size of the group of comm is not a multiple of the product of all dims[i] with dims[i]>0.

I do not include the bold portion into the sentence, because nobody complained with the existing sentence in MPI_DIMS_CREATE of MPI-1.1 - MPI-2.2.

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

I try to fix all these small things now, that we can read it in Chicago. There is nothing complicated in.

Thanks, Torsten

Thank you for your detailed and very helpful review.

I'll add with my next steps the still needed text for handling non-optimal results based on heuristics. This will be the more complicated wording and content. I'll do it as an additional option D.

Rolf

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-02-05 09:29:23 -0600

Several current implementations of MPI_DIMS_CREATE are wrong, i.e., they definitely do not follow the standard: "The dimensions are set to be as close to each other as possible" (MPI-2.2, page 249, line 16).

They use fast heuristics instead of really calculating the optimum.

There are two options:

We keep the standard as it is and all these MPI implementations are buggy. Correcting these libraries will be a non-backward compatible change.
We change the standard and allow the use of fast heuristics. Then the implementers have the freedom to choose:
- Changing the current heuristics to produce better results, or
- keeping it as it is to be backward compatible.

I'll integrate an Option D into the proposal to allow the second choice. Current text is along the first choice.

Details about current implementations of MPI_DIMS_CREATE:

I checked existing MPI_DIMS_CREATE and learned that they work with a heuristics:

Calculating a prime factorization
Filling all empty dims with 1
Starting with the largest primes and filling them into these dims enties, always into one with currentlysmallest value
Sorting the entries

Example: MPI_DIMS_CREATE(16_16_15, 3, dims)

16_16_15 = 51 * 31 * 2**8
dims = 1 x 1 x 1
Starting with the largest prime: 5, filling in to (first) smallest dims[i]
- 5 --> 5 x 1 x 1
- 3 --> 5 x 3 x 1
- 2 --> 5 x 3 x 2
- 2 --> 5 x 3 x 4
- 2 --> 5 x 6 x 4
- 2 --> 5 x 6 x 8
- 2 --> 10 x 6 x 8
- 2 --> 10 x 12 x 8
- 2 --> 10 x 12 x 16
- 2 --> 20 x 12 x 16
Sorting: --> 20 x 16 x 12.

This result is of course wrong according to the definition of MPI_DIMS_CREATE.

The correct result would be 16 x 16 x 15.

I checked MVAPICH2, OpenMPI, and CrayMPI with following results:

compiler/intel/12.0.4 mpi/openmpi/1.5.4-intel-12.0.4 and
compiler/intel/12.0 mpi/mvapich2/1.6-intel-12.0
PrgEnv-cray/4.0.36

Results:

dt= 0.000008 nnodes= 240=16*15 best ndims= 2 dims= 20 12
dt= 0.000545 nnodes= 3840=16_16_15 best ndims= 3 dims= 20 16 12
dt= 0.069598 nnodes= 61440=16_16_16*15 best ndims= 4 dims= 20 16 16 12

In all 3 cases, the results (20 12), (20 16 12) and (20 16 16 12) are worse than the optimum which would be (16 15), (16 16 15) and (16 16 16 15)

Only OpenMPI and CrayMPI (mvapich2 seems to be in an endless loop): dt=34.429274 nnodes= 1843200=16_15_12_10_8*8 ndims= 6 dims= 16 12 12 10 10 8

This result ( 16 12 12 10 10 8) is worse than[[BR]] the optimum (16 15 12 10 8 8).

Reason why the second one is really the optimum:

||# ||=dims =||||||||||=Ratios between dims[i] and dims[i+1] =||=max =||=avg. =|| ||=Result =||(16 12 12 10 10 8) ||16/12 ||12/12 ||12/10 ||10/10 ||10/8 || || || ||# || ||1.3333 ||1.000 ||1.200 ||1.000 ||1.250 ||1.333||1.1567|| ||=Optimum=||(16 15 12 10 8 8)||16/15 ||15/12 ||12/10 ||10/8 ||8/8 || || || ||# || ||1.0667 ||1.250 ||1.200 ||1.250 ||1.000 ||1.250 ||1.1533 ||

About the timing:[[BR]]

dt values are reported by OpenMPI.
dt by mvapich2 and CrayMPI: between 0.000000 and 0.000004 sec.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-02-05 10:15:03 -0600

With the last change of the Description, I only added Option D and defined C+D as proposed solution.

mpiforumbot commented 8 years ago

Originally by htor on 2012-02-05 11:47:31 -0600

Replying to RolfRabenseifner:

Replying to htor:

I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

I disagree to the word "balanced" in the general description, the returned dimensions are not optimized for balance

I would keep the text because in with the new routine, the values of gridsizes[i]/dims[i] are balanced. We need not to be better than the current wording that uses "balanced" in combination with "optional constraints". In MPI-2.2, these optional constraints have been fixed dims[i] values. With MPI_DIMS_CREATEV in MPI-3.0, it is additionally the gridsizes[i]/dims[i] rule. This is unclear and confusing. I'd go with "optimized" because it does not imply what the optimization criteria is.

"1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"

This is to make clear, that the ratio is a float and not a rounded or cut integer division. To stay as precise as possible, I would keep it until a better proposal is done. A term float(gridsizes[i])/float(dims[i]) would be worse, because it is not language neutral. Something like "the exact" or so seems more appropriate to me.

why is the constraint "If i<j, gridsize[i]=gridsize[j], dims[i]=0, and dims[j]=0 at input, then the output dims shall satisfy this property: dims[i]>=dims[j]." defined?

To get the same behavior with gridsizes == 1 as with old MPI_DIMS_CREATE, and to achieve deterministic results in most cases. ok

why would we ever say adapt = 0? It does not change program semantics and is "good" :-)

To be able to substitute MPI_DIMS_CREATE by MPI_DIMS_CREATEV with gridsizes == 1, as long as the nnodes is identical to the size of a communicator and all processes in the communicator are calling old MPI_CREATE. Why would we want to do this?

"An error will occur if the size of the group of comm is not a multiple of the product of all dims[i] with dims[i]>0." -- this doesn't seem to work because if this were true, then dims[i]==0 would need to remain 0.

I do not see a logical difference to the text of MPI_DIMS_CREATE, MPI-2.2 page 249 lines 21-22. Therefore, I would not make it more complicate. It would be possible to substitute the current sentence by

An error will occur if one ore more indexes i with dims[i]>0 exist and the size of the group of comm is not a multiple of the product of all dims[i] with dims[i]>0.

I do not include the bold portion into the sentence, because nobody complained with the existing sentence in MPI_DIMS_CREATE of MPI-1.1 - MPI-2.2.
I see, the first version seems best (together with what we have already in dims_create!

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

I try to fix all these small things now, that we can read it in Chicago. There is nothing complicated in. I think this exceeds the threshold for a proposal to be passed without a discussion and a demonstration etc.. I'm not saying we should not go ahead, but we need to discuss it in the broader audience (and not just read it!).

I'll add with my next steps the still needed text for handling non-optimal results based on heuristics. This will be the more complicated wording and content. I'll do it as an additional option D. see ....

Torsten

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-02-06 04:25:41 -0600

Replying to htor:

Replying to RolfRabenseifner:

Replying to htor:

I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

I disagree to the word "balanced" in the general description, the returned dimensions are not optimized for balance

I would keep the text because in with the new routine, the values of gridsizes[i]/dims[i] are balanced. We need not to be better than the current wording that uses "balanced" in combination with "optional constraints". In MPI-2.2, these optional constraints have been fixed dims[i] values. With MPI_DIMS_CREATEV in MPI-3.0, it is additionally the gridsizes[i]/dims[i] rule. This is unclear and confusing. I'd go with "optimized" because it does not imply what the optimization criteria is.

Yes, this is a good proposal. I'll change the four locations of "balanced" (all in the overview paragraphs) into "optimized".

"1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"

This is to make clear, that the ratio is a float and not a rounded or cut integer division. To stay as precise as possible, I would keep it until a better proposal is done. A term float(gridsizes[i])/float(dims[i]) would be worse, because it is not language neutral. Something like "the exact" or so seems more appropriate to me.

I'll change "1.0*gridsizes[i]/dims[i]" into "gridsizes[i]/dims[i](using floating point divide)"

... (topic is resolved)

why would we ever say adapt = 0? It does not change program semantics and is "good" :-)

To be able to substitute MPI_DIMS_CREATE by MPI_DIMS_CREATEV with gridsizes == 1, as long as the nnodes is identical to the size of a communicator and all processes in the communicator are calling old MPI_CREATE. Why would we want to do this?

As long as we go with Alternative C (collective MPI_DIMS_CREATEV), the new routine can calculate the optimum faster (i.e., in parallel) than the existing MPI_DIMS_CREATE. Therefore user of MPI_DIMS_CREATE may want to switch to MI_DIMS_CREATEV without loosing the semantics. With gridsizes == 1, they have the same semantics.

I know, this is not a very strong argument, but for the implementation of MPI_DIMS_CREATEV, the adapt=0 is not a real problem, because it can be internally handled as adapt=1 with a CPU size equal to one core.

... (topic is resolved)

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

I try to fix all these small things now, that we can read it in Chicago. There is nothing complicated in. I think this exceeds the threshold for a proposal to be passed without a discussion and a demonstration etc.. I'm not saying we should not go ahead, but we need to discuss it in the broader audience (and not just read it!).

This topic has a 3h slot, Wednesday March 7, 9am-12pm.

... (topic is resolved)

Rolf

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-02-06 04:34:46 -0600

Changed "balanced" into "optimized" and [[BR]] "1.0*gridsizes[i]/dims[i]" into "gridsizes[i]/dims[i] (using floating point divide)".

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-02-06 10:05:55 -0600

Attachment added: dims_create_test.c (2.4 KiB) Testing MPI_Dims_create and its factorizations

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-02-06 10:06:45 -0600

Attachment added: dims_create_test_log.txt (10.6 KiB) Protocol of test of MPI_Dims_create with three MPI libraries.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-03-06 16:46:27 -0600

In the Chicago meeting, March 2012, we decided:

Use Alternative C
remove the argument adapt

   IN     adapt     the result is independent of the 
                    underlying hardware (false) or may be adapted
                    to the underlying hardware (true) (logical)

substitute Option D by the text used for MPI_DIMS_CREATE in #194:

Advice to Implementors: MPI_Dims_create should return, as much as possible, values that will allow MPI_Cart_create to return a communicator so that neighbor communication is efficient.
Remove "MPI_DIMS_CREATE(V)" from the sectiuon title
Use the collective interface (Alternative C).
Remove Example 7.2:

Example 7.2. MPI_DIMS_CREATEV with adapt=false

|| size of comm || ndims || gridsizes || dims before the call || dims after the call || || 60 || 2 || (1200,500) || (0,0) || (12,5) || || 60 || 2 || (500,1200) || (0,0) || (5,12) || || 60 || 3 || (50,100,60) || (0,2,0) || (5,2,6) || || 60 || 3 || (1,1,1) || (0,0,0) || (5,4,3) || || 60 || 3 || (10,15,10) || (0,0,0) || (4,5,3) || || 60 || 3 || (6,15,10) || (0,0,0) || (3,5,4) || || 60 || 3 || (5,15,10) || (0,0,0) || (2,6,5) ||

Note that for the second to last example the dims values (3,5,4) and (2,6,5) would imply same set of ratios, i.e., (2.0, 3.0, 2.5) and (3.0, 2.5, 2.0), with a ratio of 1.5 between the largest and smallest. The result is (3,5,4) because of its smaller difference between largest and smallest dims[i].
Remove the explanations to Example 7.3 (now renumbered to 7.2)

Although the decision about the factorization of each dims[i] into cores[i] x nodes[i] is part of MPI_CART_CREATE, the factorization within MPI_DIMS_CREATEV with adapt=true restricts the possibilities for MPI_CART_CREATE. The first three results both have perfect ratios of (10.0,10.0,10.0), i.e., dims=(12,8,25) would be returned with adapt=false, although the values shown in the last column are not minimal. The fourth solution with dims=(10,10,12) and a core-distribution (2,2,3) has worse ratios (12.0, 8.0, 10.4), but may have minimal communication time because the sum of the node-to-node communications in all 3 directions is proportional to only 1704 grid points. Specifically, with adapt=true the routine may return dims=(10,10,24) if the group of comm is executed with 12 MPI processes per SMP node.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-03-06 19:11:16 -0600

Based on a discussion with Bill, I further changed

Those dimensions where dims[i] = 0 are set such that all ratios gridsizes[i]/dims[i] (using floating point divide) are as close to each other as possible, using an appropriate divisibility algorithm. The largest ratio divided by the smallest ratio should be minimal. If there are different possibilities for calculating these dims[i], then the solution set with minimal difference between largest and smallest dims[i] should be used.

into

Those dimensions where dims[i] = 0 are set such that they are good as input for MPI_CART_CREATE.

It needs re-reading of these parts.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-03-07 13:48:15 -0600

Latest changes as discussed in the formal re-reading.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-03-07 13:51:56 -0600

Had formal reading in the Chicago meeting, March 2012.

mpiforumbot commented 8 years ago

Originally by gropp on 2012-05-29 19:39:28 -0500

There are still too many errors in the description, starting with the fact that the language-independent form has errors (there is an info parameter in the argument description but no info parameter in the argument list, conversely, there is no adapt in the argument descriptions but an adapt in the parameter list).

mpiforumbot commented 8 years ago

Originally by jsquyres on 2012-06-20 09:48:16 -0500

1st vote failed in Japan Forum meeting, May, 2012. Moved to "author rework".

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2012-06-21 03:07:01 -0500

Vote has been yes=2, no=10, abstain=4, missed=0.

Did it fail because

we should not proceed with this ticket,
or because it is not ready (bugs in the description)?

Corrections are needed in the block with the language independent definition:

substitute "adapt" by "info"
remove "if adapt=true" in the description of the "info" argument The language bindings were correct.

The full text description was also correct, but we should remove the canceled text portions.

mpi-forum / mpi-forum-historic