nonblocking handles with RMA requests

This was a long-standing omission in the implementation. ARMCI nonblocking handles are similar to MPI RMA requests but are not 1:1 because aggregate request handles are 1:N.

This implements request handles using RMA requests, which replaces the prior implementation that just did flush(_all) instead of individual handle completion. The old implementation is preserved via the preprocessor.

This also adds a feature to switch to Rget_accumulate for atomics (all of which are blocking), which avoids a flush in this code path that might be slowed down by the need to complete more expensive, potentially non-hardware, operations.

This has not been tested thoroughly. It will be merged after sufficient testing.

Tested with:

[x] MPICH 4.2 Ch4 OFI in shared memory
[x] MPICH 4.2 Ch3 in shared memory.
[x] Open MPI 4.x in shared memory
[x] Cray MPI on LUMI
[ ] HPC-X (Open MPI 4) on Mellanox IB
[ ] Open MPI 5 on Mellanox IB
[ ] MVAPICH on Mellanox IB
[ ] MPICH UCX on Mellanox IB
[ ] MPICH OFI on Mellanox IB

pmodels / armci-mpi

nonblocking handles with RMA requests #53