Open mpiforumbot opened 8 years ago
Originally by gropp on 2014-12-10 13:11:31 -0600
The WG found this interesting, but notes that there are alternatives that may provide the same capability. These include nonblocking flush. In a straw vote, iflush received 11 votes and nflush received 3; in contrast, nsync received 9 and isync received 4.
Originally by rsthakur on 2015-06-03 15:32:11 -0500
From the June 2015 Forum meeting: Need more evidence for the performance issues (there was some disagreement), need to consider whether a global sync/flush_all across all created windows could be proposed instead.
Originally by jhammond on 2015-06-04 15:16:20 -0500
On my dual-core x86 laptop, MPI_Win_nsync
has a small performance benefit with 100 windows and a relatively large advantage with 1000 windows. I expect that the gap between MPI_Win_sync
(argv[2]=argv[1]
) and MPIX_Win_nsync
(argv[2]=1
) will be larger on other platforms, particularly multi-socket and non-x86 ones. See source for details.
Data:
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 1 1
1 windows, 1 syncs
avg = 0.000006
avg = 0.000006
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 10 10
10 windows, 10 syncs
avg = 0.000008
avg = 0.000008
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 10 1
10 windows, 1 syncs
avg = 0.000008
avg = 0.000008
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 100 100
100 windows, 100 syncs
avg = 0.000013
avg = 0.000013
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 100 1
100 windows, 1 syncs
avg = 0.000008
avg = 0.000008
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 1000 1000
1000 windows, 1000 syncs
avg = 0.000092
avg = 0.000092
jrhammon-mac01:ticket459 jrhammon$ mpiexec -n 2 ./test_win_sync.x 1000 1
1000 windows, 1 syncs
avg = 0.000008
avg = 0.000008
test_win_sync.c:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char * argv[])
{
MPI_Init(&argc, &argv);
int n = (argc>1) ? atoi(argv[1]) : 1000;
int m = (argc>2) ? atoi(argv[2]) : n;
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank==0) printf("%d windows, %d syncs\n", n, m);
MPI_Comm node_comm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &node_comm);
int* baseptrs[n];
MPI_Win win[n];
for (int i=0; i<n; i++) {
MPI_Win_allocate_shared(sizeof(int), sizeof(int), MPI_INFO_NULL, node_comm, &(baseptrs[i]), &(win[i]));
MPI_Win_lock_all(0,win[i]);
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank==0) {
for (int i=0; i<n; i++) {
-(baseptrs[i]) = i;
}
}
double t0 = MPI_Wtime();
for (int i=0; i<m; i++) {
MPI_Win_sync(win[i]);
}
double t1 = MPI_Wtime();
double dt = t1-t0, avg;
MPI_Allreduce(&dt, &avg, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
avg /= size;
if (rank==1) {
for (int i=0; i<n; i++) {
MPI_Aint size;
int disp_unit;
int * ptr;
MPI_Win_shared_query(win[i], 0, &size, &disp_unit, &ptr);
int tmp = *ptr;
if (tmp!=i) printf("bad %d\n", i);
}
}
printf("avg = %lf\n", avg);
for (int i=0; i<n; i++) {
MPI_Win_unlock_all(win[i]);
MPI_Win_free(&(win[i]));
}
MPI_Comm_free(&node_comm);
MPI_Finalize();
return 0;
}
Makefile:
CC := mpicc
CFLAGS := -std=c99
all: test_win_sync.x
test_win_sync.x: test_win_sync.c
$(CC) $(CFLAGS) $< -o $@
clean:
-rm -f *.o
-rm -f *.x
Originally by gropp on 2015-09-25 08:52:07 -0500
The Sept 2015 WG discussion believes that this could be optimized within an implementation if this usage model was common.
Originally by jhammond on 2015-09-25 11:00:35 -0500
1) This usage model is quite common. It is the usage model implied by Global Arrays, which right now is almost certainly the basis for most of the MPI-3 RMA-aware compute cycles. The only way to not have O(n_globalarrays) sync ops in a call to ga_sync() is to use MPI_Win_create_dynamic and use one window for everything, but then we have an O(n) metadata problem (all the vectors of offsets), cannot use shared-memory, and cannot use array-specific info keys.
2) Those that believe this can be optimized in an implementation should describe that in sufficient detail on this ticket to convince others. I do not believe this is true. Since I have prototyped the optimization this ticket allows, it is very easy for someone to show how the same degree of optimization can be achieved without semantic changes to MPI RMA.
Originally by jhammond on 2014-10-05 20:56:14 -0500
A number of synchronization operations on the critical path of PGAS-style programming models that wish to target MPI-3 RMA would be greatly optimized by functions that take a vector of windows as arguments.
The reason for this is that many networks (including shared-memory) handle synchronization at a different granularity than window objects, such that synchronization on windows introduces unnecessary overheads. A memory barrier in
MPI_WIN_SYNC
is a good example of this. Another example is where internode operations happen on M contexts (M is often, but not necessarily, 1), where M may be much less than N, which is the number of windows. In this case, these routines may save N-M synchronization operations internally.The meaning of these functions is rather obvious by the signature. For example, a functional but unoptimized implementation of the first could be:
where the function called
count
times might be equivalent to:The optimized implementation could potentially look like the following:
This optimized implementation would be significantly faster than the naive one. And there are platforms where a full memory barrier is relatively expensive (Blue Gene/Q is such a platform).
In the case of the flush operations, the optimization across windows is in each case found when due to the software or hardware implementation, there is no separation of traffic between any two ranks associated with a window. For example, if
MPI_Win_flush_all
is implemented for Cray Aries usingdmapp_gsync
, all traffic to all remote PEs (equivalent to MPI processes) is quiesced at that moment, hence it is superfluous to call this operation repeatedly for multiple windows. At a higher level, MPICH's Ch3 is an ordered channel and I believe there is one queue for all RMA packets, henceMPI_Win_flush
on one window will have that effect on all other windows for RMA ops issued prior to that invocation.Per MPI Forum convention, the function names should not be a blocking issue until the underlying feature set is decided upon, but I elected to go with "nfoo" instead of something else because I do not want to confuse the user with the trailing "v" (as in
MPI_Alltoallv
since that has a different meaning. And yes, I find the possibility of an "N'Sync" operation in the MPI standard amusing.