User-defined op with derived datatypes yields space-inefficient reduce

mpiforumbot commented 8 years ago

Originally by jdinan on 2012-06-06 13:44:07 -0500

Description

Currently, using a derived datatype with an MPI reduction operation also requires the use of a user-defined MPI_Op. The function that implements an MPI_Op has the following C prototype:

void op_fcn(void in, void inout, int count, MPI_Datatype dtype)

Note that that user-define operations accept two buffers, but only one count and datatype. Because of this, both buffers must have the layout described by the count and datatype.

Consider a reduction on a column of a large row-major array. We can easily do a reduce operation directly on the column using an MPI vector datatype. Because this is not a built-in datatype, we must also provide a user-defined op to the reduction operation. The user-defined op expects all data to have the same layout because it takes only one datatype/count. Thus, MPI must reconstruct the sender's entire array before invoking the user-defined op, resulting in severe space inefficiency for this operation.

A test case is attached to the ticket that demonstrates the memory consumption issue.

Extended Scope

none.

History

none.

Proposed Solution

Define an MPI_Op that accepts one datatype for each buffer:

void op_fcn(void in, int count_in, MPI_Datatype dtype_in, void inout, int count_inout, MPI_Datatype dtype_inout)

This would allow MPI to pass one buffer in its packed form rather than recreating it's layout at the source.

This op could become challenging for a user to implement, thus it is necessary to investigate mechanisms to simplify this task. One possibility would be defining an op that takes two datatypes and one count. The MPI implementation would have to transform one or both datatypes to make individual units congruent. This seems doable for reductions since all processes must pass the same datatype.

Impact on Implementations

Impact on Applications and Users

Currently, reductions with derived datatypes are extremely inefficient. Fixing this issue would provide a significant performance enhancement.

Alternative Solutions

Several alternative solutions are possible:

Users can pack data before calling MPI_Reduce to avoid this problem.
An MPI implementation could pack both the in and inout buffers and pass both packed buffers to the user-define operation. When packed, both should share the same datatype and count. However, this approach still has significant space overhead.

mpiforumbot commented 8 years ago

Originally by jdinan on 2012-06-06 13:44:48 -0500

Attachment added: reduce_user_dt_and_op.c (1.3 KiB) Test case, which demonstrates memory consumption problem.

mpiforumbot commented 8 years ago

Originally by jhammond on 2014-09-09 04:57:17 -0500

We should also try to support MPI_IN_PLACE in user-defined reductions with this ticket. I'll add the text later.

mpi-forum / mpi-forum-historic