Open devreal opened 7 months ago
Note that "simultaneous" is possible only in Newtonian space time. In this universe, it is not possible in general. Of course, a single system that defines a clear frame of reference can define "simultaneous" WRT that rest frame. But an MPI system made of communicating processes without a clear frame of reference (e.g., satellite networks) won't have a precise definition of what it means to be "at the same time".
I am concerned that this API creates a false promise to the user. I added one comment line below. If outflag
was already set to 1, the user thinks they know something, but it's not true.
We already have MPI_WTIME_IS_GLOBAL
, which can employ recent advances such as PTP and addresses tracing use cases.
In short, because MPI doesn't control OS noise or scheduling, it's not in a position to guarantee any sort of harmony between processes.
/* check if we are within the time epoch */
if(MPITS_Clocksync_get_time(&cs) > barrier_stamp ) {
*outflag = 0;
data->sync_failed = 1;
} else {
*outflag = 1;
data->sync_failed = 0;
}
/* wait for the epoch to end */
while(MPITS_Clocksync_get_time(&cs) <= barrier_stamp);
/* OS noise goes crazy here */
return MPI_SUCCESS;
The problem with MPI_WTIME_IS_GLOBAL
is that without hardware support it is impossible to guarantee anything for the entire duration of an application. The proposal here comes as an extension to MPI_WTIME_IS_GLOBAL
, allowing to user to regularly update a base timeline if they need. Not perfect, but significantly better than an MPI_Barrier
.
Moreover, some networks already have a similar capability build in, this could be a way to expose it to users.
We already have MPI_WTIME_IS_GLOBAL, which can employ recent advances such as PTP and addresses tracing use cases.
MPI_WTIME_IS_GLOBAL
is nice to have to implement process harmonization but is a) not guaranteed to be available, and b) does not provide what applications need. Users still have to implement a way to actually harmonize processes, a procedure similar to what is proposed here.
But an MPI system made of communicating processes without a clear frame of reference (e.g., satellite networks) won't have a precise definition of what it means to be "at the same time".
and
In short, because MPI doesn't control OS noise or scheduling, it's not in a position to guarantee any sort of harmony between processes.
I agree that there is no guaranteed perfect harmonization. Implementations can only make a best effort.
MPI provides abstractions for operations that are commonly used by applications and/or require some engineering to do efficiently. Yes applications could implement their own broadcast but there are a handful of different ways to do that so MPI implementations take the burden to implement them and hide the complexities behind a well defined API. Do implementations always select the fastest algorithm? Probably not, but they try to make a best effort to provide the best performance. The quality of process harmonization (i.e., the resulting skew between processes) is a performance property and can range from what a barrier provides today (many microseconds) to less than a microsecond if the network provides clock synchronization capabilities, and probably close to a microsecond with proper software clock synchronization.
The application is able to describe its intentions to the MPI implementation: either process synchronization without regard to timing (MPI_Barrier
) or process harmonization (MPI_Harmonize
) if timing is important.
Problem
MPI distinguishes between synchronizing and non-synchronizing collective operations.
MPI_Barrier
is a commonly used synchronization primitive in codes that want to synchronize processes before entering a subsequent program region. It is used in most of the commonly used benchmark suites (OSU, IMB, mpiBench) in an attempt to ensure uniform start of the collective operation under test. However, the MPI standard does not provide any time synchronization guarantees forMPI_Barrier
, leading to arbitrary arrival patterns into the collective under test and thus skewing the results of the benchmark. This issue has been discussed in the literature but it is likely that the complexities of proper time synchronization have kept benchmark implementors from ensuring proper time synchronization.Proposal
MPI should provide a mechanism to provide time synchronization of processes. This is different from exposing a functionality for synchronizing in that MPI will not directly expose a synchronized clock. Instead, MPI will provide a a procedure from which processes return at the "same" physical time. Due to the nature of distributed systems, the "same time" can only be an approximation so implementations will have to make a best effort. We call this mechanism harmonization and the proposed function
MPI_Harmonize
.MPI is in a unique place as it has knowledge of the underlying hardware, including the node-local clocks and network. For example, implementations can employ globally synchronized hardware clocks if available.
Changes to the Text
Introduce
MPI_Harmonize
that takes the following arguments:A call to
MPI_Harmonize
acts like a barrier oncomm
with the extended requirement that processes return from the call at the same time based on an internal synchronized virtual clock (without synchronizing the system clock itself). Applications will be able to use this functionality to harmonize process execution and approximate a uniform arrival pattern into program regions.The
flag
parameter will be set to1
upon return if the local process found that it's execution was successfully harmonized with other processes, and0
otherwise. The value does not represent a global state and thus might differ between processes. It is up to the application to ensure that all processes are harmonized by checking the returnedflag
at a convenient point in time (ideally without introducing additional skew between processes before entering the region under test). Harmonization may fail spuriously, e.g., due to OS noise, network jitter, and clock drift, so applications must be able to handle these cases.Impact on Implementations
Implementations should provide an implementation of
MPI_Harmonize
, potentially based on the reference implementation referenced below.Impact on Users
Applications are provided with a mechanism for approximating a uniform execution of processes. Users of benchmarks can rely on benchmarks results that are not skewed by the underlying barrier implementation.
References and Pull Requests
Reference implementation: https://github.com/devreal/mpix-harmonize EuroMPI'23 paper: https://dl.acm.org/doi/10.1145/3615318.3615325 PR: https://github.com/mpi-forum/mpi-standard/pull/965
This work was a collaboration with Sascha Hunold (UWien), who does not regularly the Forum meetings but should be mentioned here.