mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
66 stars 7 forks source link

Communicator Hints for upscaled reduction ops #101

Open ahori opened 6 years ago

ahori commented 6 years ago

Problem

During the 4 calculuses in global reduction, some users may want to have intermidiate computation with upscaled datatypes, e.g., target and results in the global reduction functions are specified as fp16 but intermediate computation will be done with fp32 or fp64.

Proposal

To let MPI implementations know this, I would like to propose to have new predefined communicator infos;

mpi_assert_upscaled_reduction, and mpi_assert_upperscaled_reduction

The mpi_assert_upperscaled_reduction in high quality implementations would result in the intermediate computation in higher precision (or, wider bit-width) numbers than that of mpi_assert_upscaled_reduction.

Changes to the Text

Adding the following two items in the 6.4.4 Communicator Info section.

mpi_assert_upscaled_reduction (boolean, default: false): If set to true, then the implementation may assume that the global reduction operations will be done with the higher precision or wider bit-width numerical format than, or equal to, the datatypes specified in the argument(s) of global reduction operations in the intermediate computations.

mpi_assert_upperscaled_reduction (boolean, default: false): If set to true, then the implementation may assume that the global reduction operations will be done with the higher precision or wider bit-width numerical format than, or equal to, the intermediate datatypes with the above info mpi_assert_upscaled_reduction.

Impact on Implementations

These are just hints and there should be no significant impact on current MPI implementations

Impact on Users

There should be no impact on users neither.

References

No reference.

bosilca commented 6 years ago

A more direct and possibly simpler approach would be for the user in need for such support to define their own upscaled MPI operators.

jeffhammond commented 6 years ago

The performance of user-defined reductions is ~half of the built-in cases (I measured this in the BigMPI paper).

dholmes-epcc-ed-ac-uk commented 6 years ago

AFAIK, the problem is not solvable with user-defined operations because all intermediate values are of the low-precision/narrow-width type (by definition, in MPI, of the operation function signature).

Consider a reduction of A+B+C+D where A == -D, B == -C (so the reduction result should be 0) but where A > 0, B > 0, A + B > MAX_FP16

In this case, only the associativity property (assumed by MPI) can prevent overflow but this requires MPI to order the execution of the operations very carefully, i.e. A+(B+C)+D. This is not something a user-defined operation function can influence. However, an upscaled intermediate result would not suffer overflow, even if the operations were executed in some other order. This requires fn(FP16 left, FP16 right, FP32 result), which does not match the MPI function signature.

Further, consider what happens if (B+C) is near zero rather than equal to zero. Now there is no way to avoid underflow, without triggering overflow, unless MPI can assume commutativity and has a mechanism to order the values before executing any operations. Again, a user-defined operation cannot influence this and so does not help but upscaled intermediate results does solve all the issues.

dholmes-epcc-ed-ac-uk commented 6 years ago

A more general solution to be to permit reductions of values of any homogeneous type that produce a result of a different user-specified type. These reductions would need a new API, e.g. MPI_[ALL]REDUCE_HETEROGENEOUS that takes two MPI_DATATYPE parameters and a new user-defined operation function signature MPI_OP_HETEROGENEOUS like fn(IN_TYPE left, IN_TYPE right, OUT_TYPE result) that can only be used with the new reduction API. Of course, that immediately requires even more generality because MPI would internally need additional user-defined functions as the reduction proceeded, like: fn(IN_TYPE left, OUT_TYPE right, OUT_TYPE result) fn(OUT_TYPE left, IN_TYPE right, OUT_TYPE result) fn(OUT_TYPE left, OUT_TYPE right, OUT_TYPE result)

I can already hear the wailing and gnashing of teeth that this suggestion will provoke --- I'll get my coat.

Addendum: the more direct and possibly simpler approach would be for the user to convert all their values to the upscaled type before calling MPI and then use a normal MPI reduction with that upscaled type.

bosilca commented 6 years ago

Performance issue aside, it is not clear what this proposal tries to achieve. I interpret this as doing the local operation on the upscaled type, but maintaining the communication type to the original type (similar to what some of the current processors are doing). Clearly, @dholmes-epcc-ed-ac-uk has a different view, where the entire collective (operations and communications) must be done in an upscaled type. From an implementor point of view this hardly match how MPI does things currently, and will require a significant overhaul of most of the implementations (changing the signature of MPI ops, dynamic amount of data exchanged during the collective).

I now see the updated message from @dholmes-epcc-ed-ac-uk and I confirm the gnashing. Moreover, it is not clear what the MPI can do better than the solution where the user does the upscale of the data into whatever format they expect to provide the best accuracy for their operation.

Addendum: one really needs to go to exascale with poorly implemented MPI reductions to see a benefit from an upscale operation from 16 to 64 bits.

ahori commented 6 years ago

Performance issue aside, I believe this is the issue of the difference of reduction results between different implementations, if the standard allows implementors to decided to have internally upscaled reduction ops. Of course, the reduction results may varied, mostly, depending on the reduction order (depending on the protocol). However, the difference between the reduction results on upscaled values and not upscaled values would be more divergent than the difference of the orders. I believe most users expect that any results, not performance, obtained by using different MPI implementations are "similar," and I believe this must be clearly documented in the standard.

jdinan commented 6 years ago

This doesn't look to me like an assert (i.e. assertion about the application's usage of MPI). It sounds like a hint requesting MPI to minimize rounding errors in reductions. A key name like "mpi_reduce_rounding_errors" might capture the desired semantic more clearly.

ahori commented 6 years ago

@jdinan For me, the other info keys do not look like asserts neither :-O Anyway, I have nothing particular the info key names. My point is that there should be two levels (at least) for the upscaled internal computation.

ahori commented 6 years ago

P2P-WG, Sep., 19, 2018, MPI Forum at BSC

Ticket101-P2P-WG-Barcelona.pdf -- updated version to reflect the comments in the discussion

NO: 4/10 YES: 0/10 ABSTAIN: 0/10

tonyskjellum commented 6 years ago

Hi this should be attached with a label to the collective WG if not ... we are managing chapter 6 too .. thank you :-)

Anthony Skjellum, PhD 205-807-4968

On Sep 19, 2018, at 11:56 AM, Atsushi Hori notifications@github.com wrote:

P2P-WG, Sep., 19, 2018, MPI Forum at BSC

Ticket101-P2P-WG-Barcelona.pdf -- updated version to reflect the comments in the discussion

NO: 4/10 YES: 0/10 ABSTAIN: 0/10

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

mhoemmen commented 6 years ago

@jdinan wrote:

This doesn't look to me like an assert (i.e. assertion about the application's usage of MPI). It sounds like a hint requesting MPI to minimize rounding errors in reductions. A key name like "mpi_reduce_rounding_errors" might capture the desired semantic more clearly.

It's a bit finer-grained than that. I might want bitwise reproducible sums, that cost more and may not have full hardware support, or I might just want "more accurate sums." Bitwise reproducibility is a correctness promise, not just a hint.

ahori commented 6 years ago

@mhoemmen

It's a bit finer-grained than that. I might want bitwise reproducible sums, that cost more and may not have full hardware support, or I might just want "more accurate sums." Bitwise reproducibility is a correctness promise, not just a hint.

I understand what you want to have. Although the new mechanism to tune the behavior of global reduction operations, not as hints, might be the same with the of the upscaled reduction, this is out of the scope of this ticket and you may create another one.