devreal commented 7 months ago

Problem

MPI distinguishes between synchronizing and non-synchronizing collective operations. MPI_Barrier is a commonly used synchronization primitive in codes that want to synchronize processes before entering a subsequent program region. It is used in most of the commonly used benchmark suites (OSU, IMB, mpiBench) in an attempt to ensure uniform start of the collective operation under test. However, the MPI standard does not provide any time synchronization guarantees for MPI_Barrier, leading to arbitrary arrival patterns into the collective under test and thus skewing the results of the benchmark. This issue has been discussed in the literature but it is likely that the complexities of proper time synchronization have kept benchmark implementors from ensuring proper time synchronization.

Proposal

MPI should provide a mechanism to provide time synchronization of processes. This is different from exposing a functionality for synchronizing in that MPI will not directly expose a synchronized clock. Instead, MPI will provide a a procedure from which processes return at the "same" physical time. Due to the nature of distributed systems, the "same time" can only be an approximation so implementations will have to make a best effort. We call this mechanism harmonization and the proposed function MPI_Harmonize.

MPI is in a unique place as it has knowledge of the underlying hardware, including the node-local clocks and network. For example, implementations can employ globally synchronized hardware clocks if available.

Changes to the Text

Introduce MPI_Harmonize that takes the following arguments:

int MPI_Harmonize(MPI_Comm comm, int *flag);

A call to MPI_Harmonize acts like a barrier on comm with the extended requirement that processes return from the call at the same time based on an internal synchronized virtual clock (without synchronizing the system clock itself). Applications will be able to use this functionality to harmonize process execution and approximate a uniform arrival pattern into program regions.

The flag parameter will be set to 1 upon return if the local process found that it's execution was successfully harmonized with other processes, and 0 otherwise. The value does not represent a global state and thus might differ between processes. It is up to the application to ensure that all processes are harmonized by checking the returned flag at a convenient point in time (ideally without introducing additional skew between processes before entering the region under test). Harmonization may fail spuriously, e.g., due to OS noise, network jitter, and clock drift, so applications must be able to handle these cases.

Impact on Implementations

Implementations should provide an implementation of MPI_Harmonize, potentially based on the reference implementation referenced below.

Impact on Users

Applications are provided with a mechanism for approximating a uniform execution of processes. Users of benchmarks can rely on benchmarks results that are not skewed by the underlying barrier implementation.

References and Pull Requests

Reference implementation: https://github.com/devreal/mpix-harmonize EuroMPI'23 paper: https://dl.acm.org/doi/10.1145/3615318.3615325 PR: https://github.com/mpi-forum/mpi-standard/pull/965

This work was a collaboration with Sascha Hunold (UWien), who does not regularly the Forum meetings but should be mentioned here.

wgropp commented 7 months ago

Note that "simultaneous" is possible only in Newtonian space time. In this universe, it is not possible in general. Of course, a single system that defines a clear frame of reference can define "simultaneous" WRT that rest frame. But an MPI system made of communicating processes without a clear frame of reference (e.g., satellite networks) won't have a precise definition of what it means to be "at the same time".

jeffhammond commented 7 months ago

I am concerned that this API creates a false promise to the user. I added one comment line below. If outflag was already set to 1, the user thinks they know something, but it's not true.

We already have MPI_WTIME_IS_GLOBAL, which can employ recent advances such as PTP and addresses tracing use cases.

In short, because MPI doesn't control OS noise or scheduling, it's not in a position to guarantee any sort of harmony between processes.

    /* check if we are within the time epoch */
    if(MPITS_Clocksync_get_time(&cs) > barrier_stamp ) {
        *outflag = 0;
        data->sync_failed = 1;
    } else {
        *outflag = 1;
        data->sync_failed = 0;
    }

    /* wait for the epoch to end */
    while(MPITS_Clocksync_get_time(&cs) <= barrier_stamp);

    /* OS noise goes crazy here */

    return MPI_SUCCESS;

bosilca commented 7 months ago

The problem with MPI_WTIME_IS_GLOBAL is that without hardware support it is impossible to guarantee anything for the entire duration of an application. The proposal here comes as an extension to MPI_WTIME_IS_GLOBAL, allowing to user to regularly update a base timeline if they need. Not perfect, but significantly better than an MPI_Barrier.

Moreover, some networks already have a similar capability build in, this could be a way to expose it to users.

devreal commented 7 months ago

We already have MPI_WTIME_IS_GLOBAL, which can employ recent advances such as PTP and addresses tracing use cases.

MPI_WTIME_IS_GLOBAL is nice to have to implement process harmonization but is a) not guaranteed to be available, and b) does not provide what applications need. Users still have to implement a way to actually harmonize processes, a procedure similar to what is proposed here.

But an MPI system made of communicating processes without a clear frame of reference (e.g., satellite networks) won't have a precise definition of what it means to be "at the same time".

and

In short, because MPI doesn't control OS noise or scheduling, it's not in a position to guarantee any sort of harmony between processes.

I agree that there is no guaranteed perfect harmonization. Implementations can only make a best effort.

MPI provides abstractions for operations that are commonly used by applications and/or require some engineering to do efficiently. Yes applications could implement their own broadcast but there are a handful of different ways to do that so MPI implementations take the burden to implement them and hide the complexities behind a well defined API. Do implementations always select the fastest algorithm? Probably not, but they try to make a best effort to provide the best performance. The quality of process harmonization (i.e., the resulting skew between processes) is a performance property and can range from what a barrier provides today (many microseconds) to less than a microsecond if the network provides clock synchronization capabilities, and probably close to a microsecond with proper software clock synchronization.

The application is able to describe its intentions to the MPI implementation: either process synchronization without regard to timing (MPI_Barrier) or process harmonization (MPI_Harmonize) if timing is important.

mpi-forum / mpi-issues