mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
67 stars 7 forks source link

More flexible MPI_Reduce_locals needed #530

Open softwaretraff opened 2 years ago

softwaretraff commented 2 years ago

Problem

The current MPI_Reduce_local operation (Section 6.9.7 of MPI-4.0) has severely restricted functionality: it "adds" an in-argument to an inout-argument in that order. It is thus not possible to directly "add" two different in-arguments with the result stored in an out-argument, neither is it possible to add two arguments in the order of inout-argument and then in-argument. This limits the effectiveness of MPI_Reduce_local for implementing own, collective reduction operations since it often makes it necessary to copy arguments around.

Proposal

It is proposed to add a 3-argument MPI_Reduce_locals to the standard which by permitting the use of MPI_IN_PLACE provides the full flexibility desirable for implementing own collective reduction operations.

Changes to the Text

MPI_REDUCE_LOCALS( inbuf, argbuf, inoutbuf, count, datatype, op) IN inbuf input buffer (choice) IN argbuf input buffer (choice) INOUT inoutbuf combined input and output buffer (choice) IN count number of elements in inbuf, argbuf and inoutbuf buf fers (nonnegative integer) IN datatype data type of elements of inbuf, argbuf and inoutb uf buffers (handle) IN op operation (handle)

int MPI_Reduce_locals(const void inbuf, const void argbuf, void* inoutbuf, int count, MPI_Datatype datatype, MPI_Op op)

MPI_Reduce_locals(inbuf, argbuf, inoutbuf, count, datatype, op, ierror) TYPE(), DIMENSION(..), INTENT(IN) :: inbuf TYPE(), DIMENSION(..), INTENT(IN) :: argbuf TYPE(*), DIMENSION(..) :: inoutbuf INTEGER, INTENT(IN) :: count TYPE(MPI_Datatype), INTENT(IN) :: datatype TYPE(MPI_Op), INTENT(IN) :: op INTEGER, OPTIONAL, INTENT(OUT) :: ierror MPI_REDUCE_LOCALS(INBUF, ARGBUF, INOUTBUF, COUNT, DATATYPE, OP, IERROR)

INBUF(*), INOUTBUF(*) INTEGER COUNT, DATATYPE, OP, IERROR The function applies the operation given by op element-wise to the elements of inbuf and argbuf in that order with the result stored element-wise in inoutbuf, as explained for user-defined operations in Section 5.9.5. The inbuf, argbuf and inoutbuf (the two inputs as well as the result) have the same number of elements given by count and the same datatype given by datatype. If the MPI_IN_PLACE option is given for either inbuf or argbuf (or both), the corresponding input is taken from inoutbuf. The MPI_IN_PLACE is not allowed for the inoutbuf argument. The inbuf and argbuf buffers are not required to be distinct, but must be distinct from the inoutbuf argument. Rationale: In applications (libraries, typically) applying local reductions with MPI predefined operators, a specific order of the arguments (for non-commutative, user-defined operators) may be required, and some argument may not be placed in the required input or output buffer, which with a restricted, two-argument local reduction function entails extra, local copying of either or both arguments. The three-argument function alleviates such extra copying. A call to MPI_Reduce_local can always be replaced by a call to MPI_Reduce_locals with MPI_IN_PLACE as the second input argument. Examples: Let A, X, Y be the buffers provided for inoutbuf, inbuf, and argbuf, respectively. The following reduction-assignments can readily be implemented with MPI_Reduce_locals: * A = X op Y (with MPI_Reduce_local it would be required to first copy Y into A) * A = A op Y (if X is MPI_IN_PLACE. With MPI_Reduce_local this would require first reducing into Y destructively, and then copying the result Y into A) * A = X op A (if Y is MPI_IN_PLACE) and even * A = X op X (if Y is the same as X) * A = A op A (if both X and Y are MPI_IN_PLACE) # Impact on Implementations Implementation should be straightforward in MPI libraries. # Impact on Users None for existing users. New users will be happy. # References and Pull Requests
bosilca commented 2 years ago

Actually, OMPI does provide a similar capability at a lower level than the MPI API. I mention this in our discussion there, you can find the link above.

softwaretraff commented 2 years ago

Hi George,

that's what I assumed. However, this doesn't help the application programmer. Supporting the 3-argument MPI_Reduce_locals operations should come at very little implementation effort.

Jesper

softwaretraff commented 2 years ago

Hi again,

actually, the discussion in your link is great, and seems to support having a 3-argument MPI_Reduce_locals?

Jesper

jeffhammond commented 2 years ago

Do we need a new MPI_Op_create for this to comply with the C function type?

softwaretraff commented 2 years ago

Good point. Would be possible to do without, but would then cost for user Op's, so perhaps yes

RolfRabenseifner commented 2 years ago

Current MPI_OP_CREATE defines callbacks only for (invec,inoutvec) with elementwise inoutvec[i] = invec[i] o inoutvec[i]. The proposed MPI_REDUCE_LOCALS without MPI_INPLACE therefore always requires inoutbuf = argbuf ; user_defined_operation(inbuf, inoutbuf ) which is in contradiction to the performance goals of MPI. May be resolved by providing this new API only for predefinded operations. But then, a new inquiry function would be needed: MPI_OP_USERDEFINED(IN op, OUT userdefined).

This limits the effectiveness of MPI_Reduce_local for implementing own, collective reduction operations since it often makes it necessary to copy arguments around.

Then the user can implement two different algorithms for predefined and userdefined operations, the first one using the new MPI_REDUCE_LOCALS and the second one usingthe old MPI_REDUCE_LOCAL.

softwaretraff commented 1 year ago

Dear Forum,

I still think a three argument MPI_Reduce_locals as outlined would be tremendously useful for those writing their own library reduction-like functions/collectives - the 2-argument MPI_Reduce_local in many cases forces unnecessary copying, especially if commutativity is not given/to be exploited. The proposal above should be extended by an MPI_Op_create for 3-argument user functions as well. I can provide a proposal/text, if there is interest in taking this to MPI 4.1

Jesper