RMA sync ops with vector of windows

mpiforumbot commented 8 years ago

Originally by jhammond on 2014-10-05 20:56:14 -0500

A number of synchronization operations on the critical path of PGAS-style programming models that wish to target MPI-3 RMA would be greatly optimized by functions that take a vector of windows as arguments.

The reason for this is that many networks (including shared-memory) handle synchronization at a different granularity than window objects, such that synchronization on windows introduces unnecessary overheads. A memory barrier in MPI_WIN_SYNC is a good example of this. Another example is where internode operations happen on M contexts (M is often, but not necessarily, 1), where M may be much less than N, which is the number of windows. In this case, these routines may save N-M synchronization operations internally.

int MPI_Win_nsync(int count, MPI_Win wins[])
int MPI_Win_nflush(int rank, int count, MPI_Win wins[])
int MPI_Win_nflush_all(int count, MPI_Win wins[])
int MPI_Win_nflush_local(int rank, int count, MPI_Win wins[])
int MPI_Win_nflush_local_all(int count, MPI_Win wins[])

The meaning of these functions is rather obvious by the signature. For example, a functional but unoptimized implementation of the first could be:

int MPI_Win_nsync(int count, MPI_Win wins[])
{
  int rc;
  for (int i=0; i<count; i++) {
    rc = MPI_Win_sync(wins[i]);
    if (rc!=MPI_SUCCESS) return rc;
  }
  return MPI_SUCCESS;
}

where the function called count times might be equivalent to:

int MPI_Win_sync(MPI_Win win)
{
/* GCC, XLC, and probably other compilers support this intrinsic */
  __sync_synchronize();
  return MPI_SUCCESS;
}

The optimized implementation could potentially look like the following:

int MPI_Win_nsync(int count, MPI_Win wins[])
{
/* GCC, XLC, and probably other compilers support this intrinsic */
  __sync_synchronize();
  return MPI_SUCCESS;
}

This optimized implementation would be significantly faster than the naive one. And there are platforms where a full memory barrier is relatively expensive (Blue Gene/Q is such a platform).

In the case of the flush operations, the optimization across windows is in each case found when due to the software or hardware implementation, there is no separation of traffic between any two ranks associated with a window. For example, if MPI_Win_flush_all is implemented for Cray Aries using dmapp_gsync, all traffic to all remote PEs (equivalent to MPI processes) is quiesced at that moment, hence it is superfluous to call this operation repeatedly for multiple windows. At a higher level, MPICH's Ch3 is an ordered channel and I believe there is one queue for all RMA packets, hence MPI_Win_flush on one window will have that effect on all other windows for RMA ops issued prior to that invocation.

Per MPI Forum convention, the function names should not be a blocking issue until the underlying feature set is decided upon, but I elected to go with "nfoo" instead of something else because I do not want to confuse the user with the trailing "v" (as in MPI_Alltoallv since that has a different meaning. And yes, I find the possibility of an "N'Sync" operation in the MPI standard amusing.

mpiforumbot commented 8 years ago

Originally by gropp on 2014-12-10 13:11:31 -0600

The WG found this interesting, but notes that there are alternatives that may provide the same capability. These include nonblocking flush. In a straw vote, iflush received 11 votes and nflush received 3; in contrast, nsync received 9 and isync received 4.

mpiforumbot commented 8 years ago

Originally by rsthakur on 2015-06-03 15:32:11 -0500

From the June 2015 Forum meeting: Need more evidence for the performance issues (there was some disagreement), need to consider whether a global sync/flush_all across all created windows could be proposed instead.

mpiforumbot commented 8 years ago

Originally by jhammond on 2015-06-04 15:16:20 -0500

On my dual-core x86 laptop, MPI_Win_nsync has a small performance benefit with 100 windows and a relatively large advantage with 1000 windows. I expect that the gap between MPI_Win_sync (argv[2]=argv[1]) and MPIX_Win_nsync (argv[2]=1) will be larger on other platforms, particularly multi-socket and non-x86 ones. See source for details.

Data:

jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 1 1
1 windows, 1 syncs
avg = 0.000006
avg = 0.000006
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 10 10
10 windows, 10 syncs
avg = 0.000008
avg = 0.000008
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 10 1
10 windows, 1 syncs
avg = 0.000008
avg = 0.000008
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 100 100
100 windows, 100 syncs
avg = 0.000013
avg = 0.000013
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 100 1
100 windows, 1 syncs
avg = 0.000008
avg = 0.000008
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 1000 1000
1000 windows, 1000 syncs
avg = 0.000092
avg = 0.000092
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 1000 1
1000 windows, 1 syncs
avg = 0.000008
avg = 0.000008

test_win_sync.c:

#include <stdio.h>
#include <stdlib.h>

#include <mpi.h>

int main(int argc, char * argv[])
{
    MPI_Init(&argc, &argv);

    int n = (argc>1) ? atoi(argv[1]) : 1000;
    int m = (argc>2) ? atoi(argv[2]) : n;

    int size, rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (rank==0) printf("%d windows, %d syncs\n", n, m);

    MPI_Comm node_comm;
    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &node_comm);

    int* baseptrs[n];
    MPI_Win win[n];
    for (int i=0; i<n; i++) {
        MPI_Win_allocate_shared(sizeof(int), sizeof(int), MPI_INFO_NULL, node_comm, &(baseptrs[i]), &(win[i]));
        MPI_Win_lock_all(0,win[i]);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if (rank==0) {
        for (int i=0; i<n; i++) {
            -(baseptrs[i]) = i;
        }
    }

    double t0 = MPI_Wtime();
    for (int i=0; i<m; i++) {
        MPI_Win_sync(win[i]);
    }
    double t1 = MPI_Wtime();

    double dt = t1-t0, avg;
    MPI_Allreduce(&dt, &avg, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    avg /= size;

    if (rank==1) {
        for (int i=0; i<n; i++) {
            MPI_Aint size;
            int disp_unit;
            int * ptr;
            MPI_Win_shared_query(win[i], 0, &size, &disp_unit, &ptr);
            int tmp = *ptr;
            if (tmp!=i) printf("bad %d\n", i);
        }
    }

    printf("avg = %lf\n", avg);

    for (int i=0; i<n; i++) {
        MPI_Win_unlock_all(win[i]);
        MPI_Win_free(&(win[i]));
    }

    MPI_Comm_free(&node_comm);

    MPI_Finalize();
    return 0;
}

Makefile:

CC     := mpicc
CFLAGS := -std=c99

all: test_win_sync.x

test_win_sync.x: test_win_sync.c
    $(CC) $(CFLAGS) $< -o $@

clean:
    -rm -f *.o
    -rm -f *.x

mpiforumbot commented 8 years ago

Originally by gropp on 2015-09-25 08:52:07 -0500

The Sept 2015 WG discussion believes that this could be optimized within an implementation if this usage model was common.

mpiforumbot commented 8 years ago

Originally by jhammond on 2015-09-25 11:00:35 -0500

1) This usage model is quite common. It is the usage model implied by Global Arrays, which right now is almost certainly the basis for most of the MPI-3 RMA-aware compute cycles. The only way to not have O(n_globalarrays) sync ops in a call to ga_sync() is to use MPI_Win_create_dynamic and use one window for everything, but then we have an O(n) metadata problem (all the vectors of offsets), cannot use shared-memory, and cannot use array-specific info keys.

2) Those that believe this can be optimized in an implementation should describe that in sufficient detail on this ticket to convince others. I do not believe this is true. Since I have prototyped the optimization this ticket allows, it is very easy for someone to show how the same degree of optimization can be achieved without semantic changes to MPI RMA.

mpi-forum / mpi-forum-historic

RMA sync ops with vector of windows #459