Simple RMA accumulate usage hint

Here is a proposal for a simplified info key to allow applications to express their accumulate usage preference. Some applications may require high throughput for a large number of elements in a single MPI_Accumulate call and can tolerate the dependence on progress at the target while others may rely on single (or small number of) element accumulates with a focus on latency and not relying on progress at the target. The boundary between the two choices is fluent but few applications are likely operating close to that threshold.

The "mpi_accumulate_preference” key has two options:

“latency”: request that the implementation optimize for latency of accumulating small numbers of elements at once, potentially at the cost of lower throughput for larger element counts.
"throughput": request that the implementation optimize for throughput of large numbers of elements to accumulate, potentially at the cost of higher latency for small-count accumulates and the dependence of progress at the target. This is the default (to not hurt performance of large accumulate operations).

The meaning of "small" and "large" is implementation- (and likely platform-)specific where the threshold is. Many applications using "latency" will likely update up to a handful of elements in a call to MPI_Accumulate (or just a single element).

mpiwg-rma / rma-issues

Simple RMA accumulate usage hint #26