Open mpiforumbot opened 8 years ago
Originally by RolfRabenseifner on 2009-01-15 14:27:11 -0600
Is the extension proposed
Originally by bosilca on 2009-02-09 18:55:19 -0600
This extension already exists for the One-sided as stated on the proposal (One-sided Communications chapter page 332 line 9). This ticket is for adding the same feature for the reduction operation.
Originally by rsthakur on 2009-04-08 10:04:40 -0500
I would prefer this being made explicit on 161:9-11 instead of 162:27. Also include wording from MPI_Accumulate 332:11-12 to say "In the case of derived datatypes, the operation op applies to elements of the predefined type contained within the derived datatype."
Originally by asupalov on 2009-04-08 10:13:41 -0500
Agree with Rajeev. Otherwise OK for first reading.
Just to put into writing the reservation I expressed at the meeting: operating on noncontiguous data may prevent some CPU specific optimizations that rely on featching several contiguous data items in one memory operation, and parallel processing of the resulting chunk thru a wide arithmetic unit. Think fetching 4 32-bit floats in one 128-bit memory op and adding them in one CPU op with the following write-back into the memory.
George indicated that the reductions will most likely be done on internal packed buffers anyway, so this reservation may have been addressed. It still appears that availability of built-in gather/scatter hardware may eliminate the need for additional packing in this case. Let me think about this offline.
Originally by moody20 on 2009-04-08 10:21:07 -0500
And how does this affect MPI_MINLOC and MPI_MAXLOC? Maybe we just keep 164:47-48 and say this extension doesn't apply to MPI_MINLOC/MPI_MAXLOC?
Originally by rsthakur on 2009-04-08 10:38:22 -0500
Well, MPI_Accumulate does not disallow minloc/maxloc.
Originally by moody20 on 2009-04-08 10:41:55 -0500
If we want to apply this extension to minloc/maxloc, then the lines at 164:47-48 have to be relaxed. Maybe other lines in this subsection?
Originally by herault on 2009-04-08 11:04:56 -0500
Agree with new formulation. Ok for first reading.
The text should probably specify what the count argument means when passing in derived datatypes.
Originally by bosilca on 2009-04-08 11:05:47 -0500
Author: George Bosilca
Currently the predefined MPI_Op
are limited to the predefined MPI types. This proposal try to extend their usage to user derived datatypes, constructed from the same predefined datatype. Such datatypes can be created with any MPI_Type
function, and might or not contain gaps.
The intent is to allow the MPI library to apply internal optimizations (such as pipelining) on user derived types, leading to a large performance improvement for some classes of parallel applications. Basically, all mathematical algorithms using trapezoidal matrices fall under this category.
This change has to be integrated with the "New predefined datatypes" from ticket #18.
Several users have complained about this on the past. Example where such an extension would be beneficial include most of the math algorithms working on trapezoidal matrices. Global communications with predefined operations cannot be applied on such datatypes because of their non predefined characteristic. A user defined MPI_Op
can be used instead of a predefined MPI_Op
having the same mathematical signification. While this is the current solution most of the users rely on, it induces from the MPI implementors perspective a limitation to the usage of highly optimized collective algorithms. As an example, it is difficult to implement a pipelined reduction algorithm with such datatypes, because the requirement of the MPI_Op
to work only on a complete number of elements of the corresponding type.
Extend the MPI_Op
in order to allow the global reductions to use the same flexible definition of MPI_Op
as the MPI_Accumulate
.
In the chapter "Collective Communication" a compulsory dependency between predefined datatypes and the MPI_Op
is defined. The section defining this dependency start at page 162 at line 8:
For the other predefined operations, we enumerate below the allowed combinations of op and datatype arguments.
and then later on the same page line 27:
Now, the valid datatypes for each option is specified below.
|| Op Allowed || Types ||
|| MPI_MAX, MPI_MIN || C integer, Fortran integer, Floating point ||
|| MPI_SUM, MPI_PROD || C integer, Fortran integer, Floating point, Complex ||
|| MPI_LAND, MPI_LOR, MPI_LXOR || C integer, Logical ||
|| MPI_BAND, MPI_BOR, MPI_BXOR || C integer, Fortran integer, Byte ||
However, of the discussion about the MPI_Op
from the One-sided Communications chapter (page 332 line 9) this strong binding is relaxed:
Each datatype argument must be a predefined datatype or a derived datatype, where all basic components are of the same predefined datatype. Both datatype arguments must be constructed from the same predefined datatype.
The proposed solution is to replace the following paragraph on page 161 lines 9-11 Predefined operators work only with the MPI types listed in Section 5.9.2 and Section 5.9.4 with Predefined operators can be applied on a predefined datatype or a derived datatype, where all basic components are of the same predefined datatype. In other words, these operations can be applied on any datatype having a type signature containing only one single predefined datatype, if this predefined datatype match the operation as defined in Section 5.9.2 and Section 5.9.4. In the case of derived datatypes, the predefined operator applies to elements of the predefined type contained within the derived datatype.
Then on page 162 lines 27, I propose to change from Now, the valid datatypes for each option is specified below. to Now, the valid predefined datatypes for each option is specified below.
Implementations need to modify the behavior of their global reduction functions. As most of the collective algorithms pack the data upfront (and unpack it at the end) and work on temporary buffers the change should be minimal. In an implementation of a global reduction based on point-to-point messages, a pack operation might be necessary in the case the derived datatype is not contiguous before the first operation is computed.
As an example, in the context of Open MPI, the changes corresponding to this extension, have been implemented for the all 7 algorithms for the MPI_Reduce
collective in about 4 hours, including the testing time.
This will make some code simpler as the users will not have to define their own MPI_Op
if the expected mathematical operation belongs to the MPI predefined operations and the datatype on which the operation is supposed to be applied is composed only by identical types of predefined operations. Additionally, this extension will allow such codes to run faster, as the MPI implementation can use optimized algorithms not only for the operations themselves but for the global communications where they apply.
Leave the standard as it is, and keep the burden of creating MPI_Op
for user derived datatype created from the same predefined datatype on the users.
Allow global reductions with predefined operations on user derived datatypes having a type signature composed by a single predefined datatype.
#!comment
vim: set ft=wiki tw=0 spell :
Originally by bosilca on 2009-04-08 11:09:48 -0500
Replying to moody20:
If we want to apply this extension to minloc/maxloc, then the lines at 164:47-48 have to be relaxed. Maybe other lines in this subsection?
How about changing on 164:47-48 from: In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index). to In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument built upon pairs (value and index).
Originally by rsthakur on 2009-04-08 11:12:06 -0500
4th line of new text, "if this predefined datatype match" should be "if this predefined datatype matches"
Originally by moody20 on 2009-04-08 11:12:52 -0500
Change "option" to "operation" (might as well fix this word while we're changing this line).
Then on page 162 lines 27, I propose to change from
Now, the valid datatypes for each option is specified below.
to
Now, the valid predefined datatypes for each option operation is specified below.
Originally by rsthakur on 2009-04-08 11:13:52 -0500
"built upon pairs" should be "built out of pairs"
Originally by moody20 on 2009-04-08 11:15:56 -0500
164:47-48
I like the simplicity of a single pair before throwing the more general option at the user: In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index), or a derived data type whose type signature consists of only such pairs.
Originally by rsthakur on 2009-04-08 11:20:51 -0500
"Now, the valid predefined datatypes for each option is specified below." should be "are" specified below.
Originally by rlgraham on 2010-09-18 04:13:41 -0500
What do you want to do with this ticket ? rich
Originally by gropp on 2010-09-27 09:50:14 -0500
It would be valuable to have an example code, even if that example is not included in the standard.
Originally by jhammond on 2013-12-16 12:41:35 -0600
Replying to gropp:
It would be valuable to have an example code, even if that example is not included in the standard.
I am curious how code that shows e.g. an MPI_Allreduce call using an MPI_Type_vector or MPI_Type_subarray will help this ticket. Is that usage not well understood by the MPI Forum?
Originally by gropp on 2013-12-16 12:48:07 -0600
I believe the Forum understands MPI Datatypes. What we did not get with this ticket was any specific example of use (though there were hints). There are a zillion possible extensions to MPI; why is this one so important? A good example makes it easier to see why.
Originally by jhammond on 2013-12-16 13:01:00 -0600
In terms of motivation, the closely related ticket #338 states "A primary motivation comes from the PETSc library, where derived datatypes are used to perform one-sided reductions with MPI_Accumulate. The same datatypes currently cannot be used in calls to MPI_Reduce without defining a separate MPI_Op to enable MPI_Reduce to handle data with the given derived type. This results in significant complexity within the library to match datatypes and ops when performing a given operation."
Another motivation, for which I can provide the code if it is not obvious, is performance. Some implementations, e.g. Blue Gene, provide optimized implementations of MPI_SUM and other built-in ops. If I use MPI_Type_contiguous with MPI_Allreduce instead of N MPI_DOUBLE, I'm going to take a performance hit unless I write my MPI_Op in vector intrinsics. In the case of contiguous alone, I can use the count-whatevers implementation; every case where I would need datatypes must take this performance hit. A very reasonable implementation of Hartree-Fock could use MPI_Type_struct to represent a block-sparse matrix where the blocks are not mapped to a strictly contiguous memory allocation.
Should I write up the case where I'd want to use a noncontiguous collection of doubles in an MPI_Allreduce call?
Originally by jhammond on 2014-02-04 12:25:13 -0600
I should also add that the use of datatypes is the Forum-prescribed way for getting around the limitations of integer counts. Many Forum members have told users time and time again that the way to send e.g. 10G-doubles is to use 10 1G-doubles contiguous datatype. To not add this feature in MPI demands that users reimplement all the build-in reduction operations in order to be able to use some of the most common MPI features - MPI_Reduce and MPI_Allreduce - in the context of large counts. Or we could ask them to split these ops into N calls, but this is exactly the type of inelegant thing one would like to avoid, particularly since this strategy is not always semantically equivalent in the case of p2p and thus we would be asking users to apply one workaround for p2p and a different one for reductions (non-reducing collectives can, of course, go either way).
Originally by traff on 2008-10-10 08:33:35 -0500
Author: George Bosilca
Description
Currently the predefined
MPI_Op
are limited to the predefined MPI types. This proposal try to extend the usage ofMPI_Op
to user derived datatypes, constructed from the same predefined datatype (use the same flexible definition forMPI_Op
as theMPI_Accumulate
). Such datatypes can be created with anyMPI_Type
function, and might or not contain gaps.The intent is to allow the MPI library to apply internal optimizations (such as pipelining) on user derived types, leading to a large performance improvement for some classes of parallel applications. Basically, all mathematical algorithms using trapezoidal matrices fall under this category.
This change has to be integrated with the "New predefined datatypes" from ticket #18.
In the chapter "Collective Communication" a compulsory dependency between predefined datatypes and the
MPI_Op
is defined. The section defining this dependency start at page 162 at line 8: For the other predefined operations, we enumerate below the allowed combinations of op and datatype arguments. and then later on the same page line 27: Now, the valid datatypes for each option is specified below. || Op Allowed || Types || || MPI_MAX, MPI_MIN || C integer, Fortran integer, Floating point || || MPI_SUM, MPI_PROD || C integer, Fortran integer, Floating point, Complex || || MPI_LAND, MPI_LOR, MPI_LXOR || C integer, Logical || || MPI_BAND, MPI_BOR, MPI_BXOR || C integer, Fortran integer, Byte ||However, of the discussion about the
MPI_Op
from the One-sided Communications chapter (page 332 line 9) this strong binding is relaxed: Each datatype argument must be a predefined datatype or a derived datatype, where all basic components are of the same predefined datatype. Both datatype arguments must be constructed from the same predefined datatype.History
Several users have complained about this on the past. Example where such an extension would be beneficial include most of the math algorithms working on trapezoidal matrices. Global communications with predefined operations cannot be applied on such datatypes because of their non predefined characteristic. A user defined
MPI_Op
can be used instead of a predefinedMPI_Op
having the same mathematical signification. While this is the current solution most of the users rely on, it induces from the MPI implementors perspective a limitation to the usage of highly optimized collective algorithms. As an example, it is difficult to implement a pipelined reduction algorithm with such datatypes, because the requirement of theMPI_Op
to work only on a complete number of elements of the corresponding type.Proposed Solution
The proposed solution is to replace the following paragraph on page 161 lines 9-11 Predefined operators work only with the MPI types listed in Section 5.9.2 and Section 5.9.4 with Predefined operators can be applied on a predefined datatype or a derived datatype, where all basic components are of the same predefined datatype. In other words, these operations can be applied on any datatype having a type signature containing only one single predefined datatype, if this predefined datatype matches the operation as defined in Section 5.9.2 and Section 5.9.4. In the case of derived datatypes, the predefined operator applies to elements of the predefined type contained within the derived datatype.
Then on page 162 lines 27, I propose to change from Now, the valid datatypes for each option is specified below. to Now, the valid predefined datatypes for each operation are specified below.
Finally, on page 164 lines 47-48 change from In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index). to In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index), or a derived data type whose type signature consists of only such pairs.
Impact on Implementations
Implementations need to modify the behavior of their global reduction functions. As most of the collective algorithms pack the data upfront (and unpack it at the end) and work on temporary buffers the change should be minimal. In an implementation of a global reduction based on point-to-point messages, a pack operation might be necessary in the case the derived datatype is not contiguous before the first operation is computed.
As an example, in the context of Open MPI, the changes corresponding to this extension, have been implemented for the all 7 algorithms for the
MPI_Reduce
collective in about 4 hours, including the testing time.Impact on Applications / Users
This will make some code simpler as the users will not have to define their own
MPI_Op
if the expected mathematical operation belongs to the MPI predefined operations and the datatype on which the operation is supposed to be applied is composed only by identical types of predefined operations. Additionally, this extension will allow such codes to run faster, as the MPI implementation can use optimized algorithms not only for the operations themselves but for the global communications where they apply.Alternative Solutions
Leave the standard as it is, and keep the burden of creating
MPI_Op
for user derived datatype created from the same predefined datatype on the users.Entry for the Change Log
Allow global reductions with predefined operations on user derived datatypes having a type signature composed by a single predefined datatype.