Extend predefined MPI_Op's to user defined datatypes composed of a single, predefined type

mpiforumbot commented 8 years ago

Originally by traff on 2008-10-10 08:33:35 -0500

Author: George Bosilca

Description

Currently the predefined MPI_Op are limited to the predefined MPI types. This proposal try to extend the usage of MPI_Op to user derived datatypes, constructed from the same predefined datatype (use the same flexible definition for MPI_Op as the MPI_Accumulate). Such datatypes can be created with any MPI_Type function, and might or not contain gaps.

The intent is to allow the MPI library to apply internal optimizations (such as pipelining) on user derived types, leading to a large performance improvement for some classes of parallel applications. Basically, all mathematical algorithms using trapezoidal matrices fall under this category.

This change has to be integrated with the "New predefined datatypes" from ticket #18.

In the chapter "Collective Communication" a compulsory dependency between predefined datatypes and the MPI_Op is defined. The section defining this dependency start at page 162 at line 8: For the other predeﬁned operations, we enumerate below the allowed combinations of op and datatype arguments. and then later on the same page line 27: Now, the valid datatypes for each option is speciﬁed below. || Op Allowed || Types || || MPI_MAX, MPI_MIN || C integer, Fortran integer, Floating point || || MPI_SUM, MPI_PROD || C integer, Fortran integer, Floating point, Complex || || MPI_LAND, MPI_LOR, MPI_LXOR || C integer, Logical || || MPI_BAND, MPI_BOR, MPI_BXOR || C integer, Fortran integer, Byte ||

However, of the discussion about the MPI_Op from the One-sided Communications chapter (page 332 line 9) this strong binding is relaxed: Each datatype argument must be a predeﬁned datatype or a derived datatype, where all basic components are of the same predeﬁned datatype. Both datatype arguments must be constructed from the same predeﬁned datatype.

History

Several users have complained about this on the past. Example where such an extension would be beneficial include most of the math algorithms working on trapezoidal matrices. Global communications with predefined operations cannot be applied on such datatypes because of their non predefined characteristic. A user defined MPI_Op can be used instead of a predefined MPI_Op having the same mathematical signification. While this is the current solution most of the users rely on, it induces from the MPI implementors perspective a limitation to the usage of highly optimized collective algorithms. As an example, it is difficult to implement a pipelined reduction algorithm with such datatypes, because the requirement of the MPI_Op to work only on a complete number of elements of the corresponding type.

Proposed Solution

The proposed solution is to replace the following paragraph on page 161 lines 9-11 Predeﬁned operators work only with the MPI types listed in Section 5.9.2 and Section 5.9.4 with Predeﬁned operators can be applied on a predefined datatype or a derived datatype, where all basic components are of the same predefined datatype. In other words, these operations can be applied on any datatype having a type signature containing only one single predefined datatype, if this predefined datatype matches the operation as defined in Section 5.9.2 and Section 5.9.4. In the case of derived datatypes, the predefined operator applies to elements of the predefined type contained within the derived datatype.

Then on page 162 lines 27, I propose to change from Now, the valid datatypes for each option is speciﬁed below. to Now, the valid predefined datatypes for each operation are speciﬁed below.

Finally, on page 164 lines 47-48 change from In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index). to In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index), or a derived data type whose type signature consists of only such pairs.

Impact on Implementations

Implementations need to modify the behavior of their global reduction functions. As most of the collective algorithms pack the data upfront (and unpack it at the end) and work on temporary buffers the change should be minimal. In an implementation of a global reduction based on point-to-point messages, a pack operation might be necessary in the case the derived datatype is not contiguous before the first operation is computed.

As an example, in the context of Open MPI, the changes corresponding to this extension, have been implemented for the all 7 algorithms for the MPI_Reduce collective in about 4 hours, including the testing time.

Impact on Applications / Users

This will make some code simpler as the users will not have to define their own MPI_Op if the expected mathematical operation belongs to the MPI predefined operations and the datatype on which the operation is supposed to be applied is composed only by identical types of predefined operations. Additionally, this extension will allow such codes to run faster, as the MPI implementation can use optimized algorithms not only for the operations themselves but for the global communications where they apply.

Alternative Solutions

Leave the standard as it is, and keep the burden of creating MPI_Op for user derived datatype created from the same predefined datatype on the users.

Entry for the Change Log

Allow global reductions with predefined operations on user derived datatypes having a type signature composed by a single predefined datatype.

#!comment
vim: set ft=wiki tw=0 spell :

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2009-01-15 14:27:11 -0600

Is the extension proposed

only for one-sided
or also for reduction operation?

mpiforumbot commented 8 years ago

Originally by bosilca on 2009-02-09 18:55:19 -0600

This extension already exists for the One-sided as stated on the proposal (One-sided Communications chapter page 332 line 9). This ticket is for adding the same feature for the reduction operation.

mpiforumbot commented 8 years ago

Originally by rsthakur on 2009-04-08 10:04:40 -0500

I would prefer this being made explicit on 161:9-11 instead of 162:27. Also include wording from MPI_Accumulate 332:11-12 to say "In the case of derived datatypes, the operation op applies to elements of the predefined type contained within the derived datatype."

mpiforumbot commented 8 years ago

Originally by asupalov on 2009-04-08 10:13:41 -0500

Agree with Rajeev. Otherwise OK for first reading.

Just to put into writing the reservation I expressed at the meeting: operating on noncontiguous data may prevent some CPU specific optimizations that rely on featching several contiguous data items in one memory operation, and parallel processing of the resulting chunk thru a wide arithmetic unit. Think fetching 4 32-bit floats in one 128-bit memory op and adding them in one CPU op with the following write-back into the memory.

George indicated that the reductions will most likely be done on internal packed buffers anyway, so this reservation may have been addressed. It still appears that availability of built-in gather/scatter hardware may eliminate the need for additional packing in this case. Let me think about this offline.

mpiforumbot commented 8 years ago

Originally by moody20 on 2009-04-08 10:21:07 -0500

And how does this affect MPI_MINLOC and MPI_MAXLOC? Maybe we just keep 164:47-48 and say this extension doesn't apply to MPI_MINLOC/MPI_MAXLOC?

mpiforumbot commented 8 years ago

Originally by rsthakur on 2009-04-08 10:38:22 -0500

Well, MPI_Accumulate does not disallow minloc/maxloc.

mpiforumbot commented 8 years ago

Originally by moody20 on 2009-04-08 10:41:55 -0500

If we want to apply this extension to minloc/maxloc, then the lines at 164:47-48 have to be relaxed. Maybe other lines in this subsection?

mpiforumbot commented 8 years ago

Originally by herault on 2009-04-08 11:04:56 -0500

Agree with new formulation. Ok for first reading.

The text should probably specify what the count argument means when passing in derived datatypes.

mpiforumbot commented 8 years ago

Originally by bosilca on 2009-04-08 11:05:47 -0500

Author: George Bosilca

Description

Currently the predefined MPI_Op are limited to the predefined MPI types. This proposal try to extend their usage to user derived datatypes, constructed from the same predefined datatype. Such datatypes can be created with any MPI_Type function, and might or not contain gaps.

The intent is to allow the MPI library to apply internal optimizations (such as pipelining) on user derived types, leading to a large performance improvement for some classes of parallel applications. Basically, all mathematical algorithms using trapezoidal matrices fall under this category.

This change has to be integrated with the "New predefined datatypes" from ticket #18.

History

Several users have complained about this on the past. Example where such an extension would be beneficial include most of the math algorithms working on trapezoidal matrices. Global communications with predefined operations cannot be applied on such datatypes because of their non predefined characteristic. A user defined MPI_Op can be used instead of a predefined MPI_Op having the same mathematical signification. While this is the current solution most of the users rely on, it induces from the MPI implementors perspective a limitation to the usage of highly optimized collective algorithms. As an example, it is difficult to implement a pipelined reduction algorithm with such datatypes, because the requirement of the MPI_Op to work only on a complete number of elements of the corresponding type.

Proposed Solution

Extend the MPI_Op in order to allow the global reductions to use the same flexible definition of MPI_Op as the MPI_Accumulate.

In the chapter "Collective Communication" a compulsory dependency between predefined datatypes and the MPI_Op is defined. The section defining this dependency start at page 162 at line 8: For the other predeﬁned operations, we enumerate below the allowed combinations of op and datatype arguments. and then later on the same page line 27: Now, the valid datatypes for each option is speciﬁed below. || Op Allowed || Types || || MPI_MAX, MPI_MIN || C integer, Fortran integer, Floating point || || MPI_SUM, MPI_PROD || C integer, Fortran integer, Floating point, Complex || || MPI_LAND, MPI_LOR, MPI_LXOR || C integer, Logical || || MPI_BAND, MPI_BOR, MPI_BXOR || C integer, Fortran integer, Byte ||

However, of the discussion about the MPI_Op from the One-sided Communications chapter (page 332 line 9) this strong binding is relaxed: Each datatype argument must be a predeﬁned datatype or a derived datatype, where all basic components are of the same predeﬁned datatype. Both datatype arguments must be constructed from the same predeﬁned datatype.

The proposed solution is to replace the following paragraph on page 161 lines 9-11 Predeﬁned operators work only with the MPI types listed in Section 5.9.2 and Section 5.9.4 with Predeﬁned operators can be applied on a predefined datatype or a derived datatype, where all basic components are of the same predefined datatype. In other words, these operations can be applied on any datatype having a type signature containing only one single predefined datatype, if this predefined datatype match the operation as defined in Section 5.9.2 and Section 5.9.4. In the case of derived datatypes, the predefined operator applies to elements of the predefined type contained within the derived datatype.

Then on page 162 lines 27, I propose to change from Now, the valid datatypes for each option is speciﬁed below. to Now, the valid predefined datatypes for each option is speciﬁed below.

Impact on Implementations

Implementations need to modify the behavior of their global reduction functions. As most of the collective algorithms pack the data upfront (and unpack it at the end) and work on temporary buffers the change should be minimal. In an implementation of a global reduction based on point-to-point messages, a pack operation might be necessary in the case the derived datatype is not contiguous before the first operation is computed.

As an example, in the context of Open MPI, the changes corresponding to this extension, have been implemented for the all 7 algorithms for the MPI_Reduce collective in about 4 hours, including the testing time.

Impact on Applications / Users

This will make some code simpler as the users will not have to define their own MPI_Op if the expected mathematical operation belongs to the MPI predefined operations and the datatype on which the operation is supposed to be applied is composed only by identical types of predefined operations. Additionally, this extension will allow such codes to run faster, as the MPI implementation can use optimized algorithms not only for the operations themselves but for the global communications where they apply.

Alternative Solutions

Leave the standard as it is, and keep the burden of creating MPI_Op for user derived datatype created from the same predefined datatype on the users.

Entry for the Change Log

Allow global reductions with predefined operations on user derived datatypes having a type signature composed by a single predefined datatype.

#!comment
vim: set ft=wiki tw=0 spell :

mpiforumbot commented 8 years ago

Originally by bosilca on 2009-04-08 11:09:48 -0500

Replying to moody20:

If we want to apply this extension to minloc/maxloc, then the lines at 164:47-48 have to be relaxed. Maybe other lines in this subsection?

How about changing on 164:47-48 from: In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index). to In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument built upon pairs (value and index).

mpiforumbot commented 8 years ago

Originally by rsthakur on 2009-04-08 11:12:06 -0500

4th line of new text, "if this predefined datatype match" should be "if this predefined datatype matches"

mpiforumbot commented 8 years ago

Originally by moody20 on 2009-04-08 11:12:52 -0500

Change "option" to "operation" (might as well fix this word while we're changing this line).

Then on page 162 lines 27, I propose to change from Now, the valid datatypes for each option is speciﬁed below. to Now, the valid predefined datatypes for each ~~option~~ operation is speciﬁed below.

mpiforumbot commented 8 years ago

Originally by rsthakur on 2009-04-08 11:13:52 -0500

"built upon pairs" should be "built out of pairs"

mpiforumbot commented 8 years ago

Originally by moody20 on 2009-04-08 11:15:56 -0500

164:47-48

I like the simplicity of a single pair before throwing the more general option at the user: In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index), or a derived data type whose type signature consists of only such pairs.

mpiforumbot commented 8 years ago

Originally by rsthakur on 2009-04-08 11:20:51 -0500

"Now, the valid predefined datatypes for each option is speciﬁed below." should be "are" specified below.

mpiforumbot commented 8 years ago

Originally by rlgraham on 2010-09-18 04:13:41 -0500

What do you want to do with this ticket ? rich

mpiforumbot commented 8 years ago

Originally by gropp on 2010-09-27 09:50:14 -0500

It would be valuable to have an example code, even if that example is not included in the standard.

mpiforumbot commented 8 years ago

Originally by jhammond on 2013-12-16 12:41:35 -0600

Replying to gropp:

It would be valuable to have an example code, even if that example is not included in the standard.

I am curious how code that shows e.g. an MPI_Allreduce call using an MPI_Type_vector or MPI_Type_subarray will help this ticket. Is that usage not well understood by the MPI Forum?

mpiforumbot commented 8 years ago

Originally by gropp on 2013-12-16 12:48:07 -0600

I believe the Forum understands MPI Datatypes. What we did not get with this ticket was any specific example of use (though there were hints). There are a zillion possible extensions to MPI; why is this one so important? A good example makes it easier to see why.

mpiforumbot commented 8 years ago

Originally by jhammond on 2013-12-16 13:01:00 -0600

In terms of motivation, the closely related ticket #338 states "A primary motivation comes from the PETSc library, where derived datatypes are used to perform one-sided reductions with MPI_Accumulate. The same datatypes currently cannot be used in calls to MPI_Reduce without defining a separate MPI_Op to enable MPI_Reduce to handle data with the given derived type. This results in significant complexity within the library to match datatypes and ops when performing a given operation."

Another motivation, for which I can provide the code if it is not obvious, is performance. Some implementations, e.g. Blue Gene, provide optimized implementations of MPI_SUM and other built-in ops. If I use MPI_Type_contiguous with MPI_Allreduce instead of N MPI_DOUBLE, I'm going to take a performance hit unless I write my MPI_Op in vector intrinsics. In the case of contiguous alone, I can use the count-whatevers implementation; every case where I would need datatypes must take this performance hit. A very reasonable implementation of Hartree-Fock could use MPI_Type_struct to represent a block-sparse matrix where the blocks are not mapped to a strictly contiguous memory allocation.

Should I write up the case where I'd want to use a noncontiguous collection of doubles in an MPI_Allreduce call?

mpiforumbot commented 8 years ago

Originally by jhammond on 2014-02-04 12:25:13 -0600

I should also add that the use of datatypes is the Forum-prescribed way for getting around the limitations of integer counts. Many Forum members have told users time and time again that the way to send e.g. 10G-doubles is to use 10 1G-doubles contiguous datatype. To not add this feature in MPI demands that users reimplement all the build-in reduction operations in order to be able to use some of the most common MPI features - MPI_Reduce and MPI_Allreduce - in the context of large counts. Or we could ask them to split these ops into N calls, but this is exactly the type of inelegant thing one would like to avoid, particularly since this strategy is not always semantically equivalent in the case of p2p and thus we would be asking users to apply one workaround for p2p and a different one for reductions (non-reducing collectives can, of course, go either way).

mpi-forum / mpi-forum-historic