mpiwg-rma / rma-issues

Repository to discuss internal RMA working group issues
1 stars 0 forks source link

MPI 3.1/4 – how can progress be ensured with passive-target synchronization? #28

Open csubich opened 10 months ago

csubich commented 10 months ago

Greetings,

I initially interpreted the below issue as a problem within the Intel MPI implementation, but after posting on their community forums^1 Intel confirms that this is allowable behaviour per their interpretation. Either I'm misinterpreting the specification (and thus doing something undefined), Intel is wrong, or the specification is ambiguous.

Problem

Somewhat recently, I was flummoxed by a deadlock in MPI code that used passive-target synchronization. A local process would spin-wait on a variable in a shared-memory window (using local load and MPI_Win_sync), and a remote process would (eventually) update that variable with MPI_Fetch_and_op.

The expected result was… well, progress. In fact, the Intel MPI implementation would reliably deadlock when the RMA operation involved communication over an interconnect (i.e. something more than a local shared memory access). Sample code is as follows:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv) {
    MPI_Init(&argc, &argv); // Initialize MPI
    int rank, nproc;
    // Get MPI rank and world size
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&nproc);

    int * rma_memory; // RMA memory (to be allocated)
    MPI_Win rma_window;
    MPI_Win_allocate(sizeof(int),1,MPI_INFO_NULL,MPI_COMM_WORLD,&rma_memory,&rma_window);

    // Get and display memory model for window
    int *memory_model, flag;
    MPI_Win_get_attr(rma_window, MPI_WIN_MODEL, &memory_model, &flag);
    if (*memory_model == MPI_WIN_UNIFIED) {
        printf("Rank %d created RMA window with the unified memory model\n",rank);
    } else if (*memory_model == MPI_WIN_SEPARATE) {
        printf("Rank %d created RMA window with the separate memory model\n",rank);
    } else {
        printf("Rank %d created RMA window with an unknown memory model(???)\n",rank);
    }

    *rma_memory = 0; // Initialize to zero

    // All processes will lock the window
    MPI_Win_lock_all(MPI_MODE_NOCHECK,rma_window);

    if (rank == 0) { 
        // Rank 0: wait for rank 1 to enter its spinlock, then use MPI_Fetch_and_op to increment
        // *rma_memory at rank 1

        // Receive zero-byte message indicating that rank 1 is ready to enter its spinlock
        MPI_Recv(0,0,MPI_BYTE,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

        // Wait a further 0.1s so that rank 1 should have assuredly completed any progress-making
        // MPI calls
        double tic = MPI_Wtime(); 
        while (MPI_Wtime() - tic < 0.1); 

        tic = MPI_Wtime(); // Reset tic value to account for delay

        // Perform fetch-and-op
        int one = 1; 
        int result = -1;
        MPI_Fetch_and_op(&one, &result, MPI_INT, 1, 0, MPI_SUM, rma_window);

        // Flush the window to ensure completion
        MPI_Win_flush_all(rma_window); 

        printf("Rank 0: sent %d, received %d (should be 0) in %.2fms\n",one, result, (MPI_Wtime() - tic)*1e3);
    } else if (rank == 1) {
        // Rank 1: Send a message to rank 0 indicating readiness for Fetch_and_op
        MPI_Send(0,0,MPI_BYTE,0,0,MPI_COMM_WORLD);

        double tic = MPI_Wtime();

        // Spinlock waiting for '1' to be written to the RMA_Window
        while (*rma_memory != 1 && MPI_Wtime() - tic < 5) {
            // Separate memory model: synchronize remote and local copies of window
            // Unified memory model: memory barrier
            MPI_Win_sync(rma_window);
        }
        int old_value = *rma_memory;
        printf("Rank 1: Memory value %d (should be 1) in %.2fms\n",old_value,1e3*(MPI_Wtime()-tic-0.1));

        // Demonstrate forced progress
        MPI_Win_flush(1,rma_window); // Should be a no-op, since there are no pending RMA ops from this rank
        MPI_Win_sync(rma_window);
        if (old_value != *rma_memory) {
            printf("Rank 1: After flush, memory value is now %d (should be 1)\n",*rma_memory);
        }
    }

    MPI_Win_unlock_all(rma_window);
    MPI_Win_free(&rma_window);
    MPI_Finalize();
    return 0;
}

The problem is visible even on a single node of a cluster when the shared-memory interconnect is disabled:

$ mpirun -genv 'I_MPI_SHM=off' -np 2 ./a.out
Rank 0 created RMA window with the unified memory model
Rank 1 created RMA window with the unified memory model
Rank 1: Memory value 0 (should be 1) in 4900.00ms
Rank 1: After flush, memory value is now 1 (should be 1)
Rank 0: sent 1, received 0 (should be 0) in 4900.14ms

The root problem appears to be that rank 0 is waiting on the assistance of the rank 1 to complete the Fetch part of Fetch_and_op, but the MPI_Win_sync inside the spinlock on rank 1 does not engage the MPI progress engine.

Per the specification, I think that this behaviour is surprising if not spec-noncompliant. Per §12.7 of the 4.1 draft (atomicity isn't the problem here):

U2. Accessing a location in the window that is also the target of a remote update is valid (not erroneous) but the precise result will depend on the behavior of the implementation. Updates from an origin process will appear in the memory of the target, but there are no atomicity or ordering guarantees if more than one byte is updated. Updates are stable in the sense that once data appears in memory of the target, the data remains until replaced by another update. This permits polling on a location for a change from zero to nonzero or for a particular value, but not polling and comparing the relative magnitude of values.

Replacing the Fetch_and_op call on rank 0 with separate MPI_Get and MPI_Put calls does function properly, without deadlock, even if it has ambiguous correctness (I'm not sure about the combination of Get and Put with the same RMA target) and is absolutely erroneous in the general case of multiple writers.

Proposal

Prior to the MPI 4.1 draft, I would have asked that the call to Win_sync engage the progress engine even in the unified memory model, but that's now explicitly not required (p608). I'm not sure what the required change now is, if this deadlock is not in fact an implementation bug.

The trouble seems to be twofold:

Changes to the Text

Again presuming this behaviour is intended or allowable:

Impact on Implementations

Implementations will likely complain about any specification-guaranteed background progress. In the Intel forum thread linked above, the Intel representative closed the issue (then presented as a question about progress during Win_sync) because Win_sync was simply a memory barrier.

Impact on Users

At minimum, black-and-white documentation about participation requirements would have saved me, a user, a few headaches trying to distill the unexpected deadlock into a minimal example. (The original case involved Fetch_and_op to one location, then an MPI_Put of a status flag to a second; the latter was spinwaited upon, blocking completion of the original Fetch_and_op.)

References and pull requests:

jeffhammond commented 10 months ago

Replace MPI_Win_sync(rma_window); with MPI_Win_flush(0,rma_window); and it should work.

We've debated this topic for most of the years I've been in the MPI Forum, without making any progress on it (pun intentional).

https://pmodels.github.io/casper-www/ was the pragmatic response from Argonne when I was there.

jeffhammond commented 10 months ago

This is what most people use:

int MPIX_Engage_progress(void)
{
  return MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
csubich commented 10 months ago

Replace MPI_Win_sync(rma_window); with MPI_Win_flush(0,rma_window); and it should work.

It does, yes, but I get worried about a naive implementation of flush that has some communication overhead even when no RMA operations are pending.

We've debated this topic for most of the years I've been in the MPI Forum, without making any progress on it (pun intentional).

In that case, might I request a note in a future MPI spec to state that in a passive communication epoch, even processes that only expect to be the targets of communication may need to periodically ensure MPI progress?

The headline description of a passive target epoch (p588, v4.1RFC) states:

The MPI process that owns the target window may be distinct from the two communicating MPI processes, in which case it does not participate explicitly in the communication.

… which gives the vague impression that no matter the RMA calls executed by other processes, the process that owns the target window does not need to do anything to ensure their eventual completion.

It'd be better yet if the specification could split the RMA calls into ones that may require progress on the target and ones that definitely do not, but I suspect that would be controversial.

return MPI_Probe

Oh, that makes a lot of sense, and I can use an IProbe version in the RMA context. I tend to forget that the Probe calls exist.

jeffhammond commented 10 months ago

Replace MPI_Win_sync(rma_window); with MPI_Win_flush(0,rma_window); and it should work.

It does, yes, but I get worried about a naive implementation of flush that has some communication overhead even when no RMA operations are pending.

It shouldn't expensive. Checking a queue for emptiness is trivial. It's going to be less expensive than error checking, probably by a lot.

We've debated this topic for most of the years I've been in the MPI Forum, without making any progress on it (pun intentional).

In that case, might I request a note in a future MPI spec to state that in a passive communication epoch, even processes that only expect to be the targets of communication may need to periodically ensure MPI progress?

I think it would be useful to add an example of the use case you have, which is not uncommon, and show how to do it correctly.

The headline description of a passive target epoch (p588, v4.1RFC) states:

The MPI process that owns the target window may be distinct from the two communicating MPI processes, in which case it does not participate explicitly in the communication.

… which gives the vague impression that no matter the RMA calls executed by other processes, the process that owns the target window does not need to do anything to ensure their eventual completion.

Yes, but eventually is a long time 😉

It'd be better yet if the specification could split the RMA calls into ones that may require progress on the target and ones that definitely do not, but I suspect that would be controversial.

I doubt it would be controversial. @Wee-Free-Scot and others have been looking at issues like this for a while.

return MPI_Probe

Oh, that makes a lot of sense, and I can use an IProbe version in the RMA context. I tend to forget that the Probe calls exist.

Wee-Free-Scot commented 10 months ago

The new (in MPI-4.1) subsection 2.9 in the Terms chapter defines "progress" in the MPI Standard for the first time. It only gives example procedures from point-to-point, but the rules stated there can be used to put RMA procedures into categories "must guarantee progress" and "others".

@csubich Once you've had a chance to read that and to apply it to RMA, any remaining confusion or unresolved conflicts would be really useful feedback.

csubich commented 7 months ago

@Wee-Free-Scot I apologize for having missed your request for feedback earlier. I've come back to this project and I've had the opportunity to look at the new language.

The extra language is nice, and having a black and white definition of 'progress' certainly helps. The definition of 'decoupled MPI activities' also helps explain the sense in which RMA calls might need help from a progress engine.

One subtlety, however, is that the text permits an MPI engine to not make progress on calls that might block but do not. For example, a process that calls MPI_Sendrecv to send a message to itself does not force other MPI activities to make progress because the implementation can effect the sendrecv without blocking. (The above-thread-mentioned Intel MPI does just this.)

This leads to a perverse interpretation of 'progress' that (I think) is permitted by 4.1 but might still cause the above code to deadlock:

A minor extension to the specification to include a "must make RMA progress" call would be to permit MPI_Win_test to be called on non-PSCW windows, defined to always return flag=false[^1] and thus engage progress.

[^1]: Maybe the flag could be given semantic meaning, such as "false if this window is / might be part of an access epoch," allowing MPI_Fence synchronization with MPI_MODE_NOSUCCEED to cause flag=true, or any passive-target exposure where the implementation can know that no other process has locked the window (perhaps only provably true with the separate memory model?). Otherwise, "always flag=false" retains the existing meaning of "MPI_Win_wait would have immediately returned," since calling Win_wait on a non-PSCW window is erroneous.

Wee-Free-Scot commented 7 months ago

The "perverse interpretations" that you mention are all intentional implementation flexibility.

The canonical "kick the progress engine" call is a call to MPI_Iprobe with parameters that select a message that is never sent. Each call must return false, the repetition of such calls must be guaranteed to make progress (even though no individual call is required to make progress). Being required to make progress is MPI-wide -- so it includes RMA operations, even though the canonical procedure is from the point-to-point chapter.

Given this, overloading the MPI_Win_test procedure in this way seems superfluous. Separately, it seems problematic because you can (although certainly shouldn't) mix the RMA synchronisation methods on the same window, so calling MPI_Win_test has an existing meaning.

csubich commented 7 months ago

The canonical "kick the progress engine" call is a call to MPI_Iprobe with parameters that select a message that is never sent.

First, to take a step back I'll reiterate that it is frustrating and deeply unintuitive that legal sequences of RMA calls might require non-RMA calls to avoid deadlock. I accept that's how the specification works, but the complete absence of "messages" from the 'RMA way' makes IProbe a weird tool to reach for.

It doesn't help that this issue seems to be limited to passive-target synchronization. Both fence and PSCW contain blocking calls that will force progress, so the originating problem can't occur. Here, the obvious way of writing the passive-target code did lead to the problem, and I didn't ignore any bright signposts in the specification. I would have preferred things if Win_sync was required to eventually make progress in the same way as a false-returning test, but again that ship's sailed.

Finally, if "a call to MPI_Iprobe with parameters that select a message that is never sent" is to be the only way to guarantee progress in this case, then in my opinion the specification ought to include a canonical example of this. In particular, library code can't guarantee that any tuple of (communicator, source, tag) is not being used elsewhere, so to robustly match nothing[^1] the library must clone a communicator (MPI_Comm_self[^2] being the only one that doesn't involve a collective call) to give a sterile environment for the check. Even that might have unintended consequences if the source communicator has a keyval attached with a nontrivial copy callback.

This feels absurd. The semantic intent of "make progress" is fundamental to MPI itself, but I'm really contorting myself to find a way to just engage the progress engine (if necessary) without making additional assumptions about the MPI environment.

so calling MPI_Win_test has an existing meaning.

I don't truly like proposing such an extension, but it's the only way I can see to "test" using only MPI objects guaranteed to have meaning for the RMA code, without side-effects elsewhere.

[^1]: I can't even use MPI_Proc_null to "try to match nothing and fail at it!" §3.10 defines such an IProbe to return with flag=true.

[^2]: MPI_Comm_self is also not initialized with the "sessions" model, so even this approach might fail.

Wee-Free-Scot commented 7 months ago

I agree with much of your complaint and share much of your frustration.

IMHO, it was a major step forward to include a definition of strong progress vs. weak progress and to state that MPI only requires weak progress.

IMHO, the best fix would be to require strong progress for RMA passive target operations, even if the MPI library only supports weak progress everywhere else.

Tangential to that would be a query procedure that permitted the user to ask whether strong progress is supported (to permit the user to code around the limitations of weak progress only when absolutely needed).

Your proposal of adding MPI_Progress_do (or similar) has been suggested many times before and never gets enough support. For me, it should not be needed; the fact that it is needed indicates a deeper problem.


Have you tried (and measured performance of) JeffH's suggestion of using MPI_Win_flush instead of MPI_win_sync? We will need an incontrovertible failure mode for an undeniable use-case.

wgropp commented 7 months ago

I agree with this (mostly): passive target RMA really doesn't make much sense without strong progress. The exception is the lockall case - a subset of the uses here would work with weak progress. But that's rather ugly and probably not worth it.

csubich commented 7 months ago

Your proposal of adding MPI_Progress_do (or similar) has been suggested many times before and never gets enough support. For me, it should not be needed; the fact that it is needed indicates a deeper problem.

The problem is that the specification desperately wants to assume strong progress for passive-target synchronization, IMO to the point that it gives a misleading impression about progress. Consider this bit from §12.7 (4.1):

An update by a put or accumulate operation to a public window copy becomes visible in the private copy in MPI process memory at the latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, MPI_WIN_LOCK, MPI_WIN_LOCK_ALL, or MPI_WIN_SYNC is executed on that window by the window owner. In the RMA unified memory model, an update by a put or accumulate operation to a public window copy eventually becomes visible in the private copy in MPI process memory without additional RMA calls.

… which describes the deadlock at issue here. The thin wedge of correctness comes in the "update to a public window copy" part, which is only guaranteed (earlier) by:

If an RMA operation is completed at the origin by a call to MPI_WIN_UNLOCK or MPI_WIN_FLUSH (with rank=target), MPI_WIN_UNLOCK_ALL, or MPI_WIN_FLUSH_ALL, then the operation is completed at the target by that same call.

… which, when parsing very finely, does not give any guarantees that the origin's call to MPI_Win_flush must eventually return. (In contrast, U2 on p608 seems to contemplate a spinwait that polls a memory location via load, waiting for the value to change.)

The Rationale section on p616 starts to bring this problem to light, but on p617 it only warns about deadlock when using shared memory for loads:

The use of shared memory loads and/or stores for synchronizing purposes between MPI processes does not guarantee progress, and therefore a deadlock may occur if an MPI implementation does not provide strong progress, as shown in Example 12.13.

From the implementation's point of view, a local load/store from a window seems to be equivalent to a shared memory load/store, thus permitting the deadlock in this issue. For fence and PSCW window synchronizations, the synchronization primitives enforce completion-or-blocking, but passive-target synchronization doesn't have a natural way to do this.

The frustrating part is that I don't even really need proper, strong progress (although it'd be nice, certainly), just something to add to what would otherwise be busy-wait loops. The MPI_Iprobe call will serve me fine in practice because I'm not really dealing with a perversely adversarial outer program, but in turn I'm frustrated that while the call is necessary it's so ugly and fragile.

To summarize, my thesis statement is this: it is very surprising that a non-erroneous sequence of passive-target window operations, RMA calls, loads, and stores that always completes with strong progress can nonetheless deadlock with weak progress, absent explicit and seemingly extraneous calls to a progress-causing MPI routine from another chapter of the specification.


Have you tried (and measured performance of) JeffH's suggestion of using MPI_Win_flush instead of MPI_win_sync? We will need an incontrovertible failure mode for an undeniable use-case.

In my particular case with Intel-MPI, MPI_Win_flush is sufficient to ensure progress. Perusing the source of mpich (which I believe is the base for Intel-MPI), Win_flush specifically polls for progress if the polling interval has passed, so any use of Win_flush breaks the deadlock, and as far as I can tell it doesn't have any negative performance implications.

For OpenMPI (4.1.2a1), Win_flush(0,...) as suggested does break the deadlock, but using either the self-rank (1) or another rank (2+, for suitable nproc) does not. From OpenMPI's source there, it seems like the implementation doesn't (always) call the generic progress-doer, but rather it checks for messages in specific queues. It seems like I'd either need to call Win_flush_all or guess the rank that needs help with the fetch-and-op (trivial in the sample code, not necessarily possible in general).

At the moment, I can't give good performance measurements of the cost of the latter. On the cluster I'm using, Open-MPI seems to randomly and intermittently take 10-40ms to resolve the deadlock even with non-flush progress-inducers, so the impact of flush_all isn't easy to see. The system administrators there prefer (and have tuned) Intel-MPI, hence why I use it as default.

Besides that, however, Win_flush is not one of the functions that the 4.1 specification defines as "must make progress," since it is not a test function and may not block. From what I can tell, OpenMPI seems to treat the fetch-and-op as a request that the receiving process op-and-put, which coincidentally makes it something flush-able.

jeffhammond commented 7 months ago

I agree with this (mostly): passive target RMA really doesn't make much sense without strong progress. The exception is the lockall case - a subset of the uses here would work with weak progress. But that's rather ugly and probably not worth it.

NWChem has always worked without background progress. It just runs at half the speed because, on average, half the processes are waiting in Get while the other half are in DGEMM. Global Arrays in general does not deadlock in the absence of background progress although the performance can be terrible.

The Casper project provides a way to get background progress in RMA without changing an implementation. It has been challenging to add strong progress for some networks without intrusive changes that implementations aren't willing to do.

devreal commented 6 months ago

We talked about this at the January 18, 2024 WG meeting: the outcome was that it would be nice to have a procedure that ensures progress of incoming RMA operations in passive target without the overhead of iprobe (triggers all other parts of MPI) or flush (waits for outgoing operations to complete) and the ability to become a no-op if no progress is needed. This could be a simple addition to the standard within the context of RMA and should be fairly straightforward to implement. We just need a name for it :)

Wee-Free-Scot commented 6 months ago

This is a new semantic in MPI -- currently everything that ensures progress is required to ensure all types of progress, that is, progress for all of the pending MPI operations (equivalently, for all of the MPI decoupled activities) irrespective of the "type".

That is not an argument against the proposed addition, but it is a note of caution regarding how tricky it might be to write a good definition of what the new procedure is required to do and what it is permitted not to do.

Wee-Free-Scot commented 6 months ago

Example wording: "Repeated calls to this procedure guarantee progress of MPI decoupled activities at the calling MPI process that are generated by enabled passive target operations targeting the calling MPI process from any origin process. In contrast to other MPI procedures that provide a guarantee of progress, this procedure provides no guarantee of progress for any other MPI operations."

Wee-Free-Scot commented 6 months ago

Name: MPI_PASSIVE_PROGRESS ?

devreal commented 6 months ago

We could name it MPI_WIN_PROGRESS and define that it only applies to windows used with passive target synchronization. That's already true for flush*.

Alternatives if we want to avoid progress in the name:

1) MPI_WIN_PROCESS (process incoming operations) 2) MPI_WIN_EXTRACT (extract operations from the network) 3) MPI_WIN_ADVANCE (advance incoming operations) 4) MPI_WIN_TRY (because MPI_WIN_TEST is already taken and we want to try to progress incoming operations) 5) MPI_WIN_CHECK (check for and process incoming operations)

Wee-Free-Scot commented 6 months ago

Question: will it be required that the user must ensure the program is correct even if this new procedure is implemented as a no-op? So, if the user places calls to this new procedure judiciously and the implementation chooses to do something, then the correct program might perform better, but it is still correct regardless of these conditions?

Question: will the new procedure indicate anything to the caller, such as "something happened!" vs. "nothing yet" or "the number of passive target operations completed since the last window synchronization is now X"? Procedures like MPI_TESTALL indicate whether all the referenced operations (I'm thinking MPI_RPUT, etc) are complete; MPI_WIN_TEST indicates whether all operations originating at the calling process are complete. Those are origin-side completion; this new procedure is target-side -- should it indicate completion at the target and, if so, of which operation(s)? Can I use it to ensure that a passive target operation has completed at the target (i.e., here) and then use the message that was put or re-use the memory from which a get was reading a message?

Summary: will this new procedure change user-visible semantics of RMA?

jeffhammond commented 6 months ago

I'd rather just extend MPI_Win_flush instead of adding a new function. We could define a new process sentinel like MPI_PROC_REMOTE and say that MPI_Win_flush(MPI_PROC_REMOTE,win) progresses all incoming operations, rather than the ones initiated by the calling process.

devreal commented 6 months ago

I'd rather just extend MPI_Win_flush instead of adding a new function. We could define a new process sentinel like MPI_PROC_REMOTE and say that MPI_Win_flush(MPI_PROC_REMOTE,win) progresses all incoming operations, rather than the ones initiated by the calling process.

I agree that that could be easier. My concerns with extending the semantics of MPI_WIN_FLUSH is that currently the return from MPI_WIN_FLUSH guarantees completion of operations. There is no such guarantee with MPI_PROC_REMOTE because we only see what has arrived at that specific time, not what was (or will be) sent by the remote process. So in other words: can we define a clear completion criteria for flushing incoming operations? (do we need to?)

Question: will it be required that the user must ensure the program is correct even if this new procedure is implemented as a no-op? So, if the user places calls to this new procedure judiciously and the implementation chooses to do something, then the correct program might perform better, but it is still correct regardless of these conditions?

I'm not sure I fully understand. A program is incorrect if it does not complete due to lack of progress. This procedure is a tool to help passive target RMA programs ensure correctness by giving them a chance to process outstanding operations. The use of this procedure is required if there are no other well-defined (and obvious) means to progress decoupled RMA activities (e.g., MPI_Test/Wait on a request that is part of the application-defined synchronization protocol). This is a portability tool to help with platforms that do not provide full asynchronous progress. If it is a no-op then there is no harm.

Question: will the new procedure indicate anything to the caller, such as "something happened!" vs. "nothing yet" or "the number of passive target operations completed since the last window synchronization is now X"?

I'm not sure this could be done reliably. Some incoming operations may have been handled by a previous progress invocation (test/wait on a P2P request). Should they be counted? What happens if multiple threads end up in the progress engine even though they do different things? Clearly the information returned by a call to this facility cannot be relied upon to accurately count the number of completed incoming operations so the value of this information is limited. So I'd rather not even try to count them. If the application must know then there are ways to count on the application side (using atomic ops).

jeffhammond commented 6 months ago

I'd rather just extend MPI_Win_flush instead of adding a new function. We could define a new process sentinel like MPI_PROC_REMOTE and say that MPI_Win_flush(MPI_PROC_REMOTE,win) progresses all incoming operations, rather than the ones initiated by the calling process.

I agree that that could be easier. My concerns with extending the semantics of MPI_WIN_FLUSH is that currently the return from MPI_WIN_FLUSH guarantees completion of operations. There is no such guarantee with MPI_PROC_REMOTE because we only see what has arrived at that specific time, not what was (or will be) sent by the remote process. So in other words: can we define a clear completion criteria for flushing incoming operations? (do we need to?)

A call to MPI_WIN_FLUSH(MPI_PROC_REMOTE,win) causes the calling process to participate in incoming RMA operations that require such activity to be completed remotely from the perspective of the initiating process. For example, an implementation of MPI_ACCUMULATE might require the target process to perform the reduction operation; a call to MPI_WIN_FLUSH(MPI_PROC_REMOTE,win) will cause this to occur if it is possible.

csubich commented 6 months ago

A call to MPI_WIN_FLUSH(MPI_PROC_REMOTE,win) causes the calling process to participate in incoming RMA operations that require such activity to be completed remotely from the perspective of the initiating process.

A subtle point: attaching this to MPI_Win_flush would logically require progress with that specific MPI call, with the flush blocking until the "local side" of any remote operation is complete.

The minimum function necessary to address my top-of-thread issue is not a "do it now" progress, but a "do it eventually" progress. Calling MPI_TBD on process local in a loop should result in (with no specific order):

However, the "in a loop" might be an important restriction. An MPI implementation that offers background progress threads would be able to implement MPI_TBD as a no-op, since background progress guarantees that these two conditions be met already.

Wee-Free-Scot commented 5 months ago

This suggests we could look at expanding the duties of the MPI_WIN_SYNC procedure, rather than the MPI_WIN_FLUSH procedure.

MPI_WIN_SYNC is already a local procedure (indeed, an immediate procedure) and it is restricted to copying data that is already present within the calling MPI process.

It is, perhaps, a small step to include "the queue of incoming passive target actions" in the "public window copy" concept.

In the separate model, we currently have an implementation differentiation between "data contained in an incoming put instruction that will affect a chunk of memory the user cannot access directly" and "data already contained in that chunk of memory the user cannot access directly". The user cannot tell the difference; they know only that the data has not yet arrived in the private window copy. Strangely, MPI_WIN_SYNC is currently only required to copy data that has already arrived in the public window copy, but is permitted to leave the incoming put/accumulate instructions unexecuted, and so (arguably) it only does half a job.

In the unified model, the public and private window copies update each other eventually without user intervention, i.e. without user calls to procedures that ensure progress. This copy/update also eventually happens without user calls to MPI_WIN_SYNC (if the user is happy to wait for eventually to elapse) but can be hastened by calling MPI_WIN_SYNC. If "public copy" is expanded to include "incoming passive target instructions", then we could say that the unified model can only be claimed when the implementation provides strong progress for passive target operations (either a capable NIC or a software agent), even if the rest of MPI only has weak progress or we could say that repeated calls to MPI_WIN_SYNC would ensure progress of incoming passive target operations as well as making changes in either window copy visible in both window copies.

This pushes the problem of "my RMA message got stuck somewhere" back one step towards the origin.

csubich commented 5 months ago

This suggests we could look at expanding the duties of the MPI_WIN_SYNC procedure, rather than the MPI_WIN_FLUSH procedure.

I would love this; it would exactly match the false intuition I had when first uncovering this conceptual mismatch. However, the existing language is fairly clear that Win_sync with the unified memory model is at best a memory barrier, with no guarantee of any back-end machinations.

This copy/update also eventually happens without user calls to MPI_WIN_SYNC (if the user is happy to wait for eventually to elapse) but can be hastened by calling MPI_WIN_SYNC.

Is this guaranteed without another form of memory barrier, under common HPC memory models?

jeffhammond commented 5 months ago

MPI_WIN_SYNC was designed to address the shortcomings of a library-based implementation of shared-memory atomics. we are breaking the backwards compatibility of its performance characteristics if we make it poke the progress engine. @jdinan

I don't hate your idea, but it's complicated.

Wee-Free-Scot commented 5 months ago

"make it poke the progress engine"

"make it poke a small part of the progress engine" -- it could be permitted to poke the whole ugly mess of the general progress engine (to maintain ease of implementation) or it could be prohibited from doing so (to maintain as much of the performance as possible).

csubich commented 5 months ago

As a spur of the moment suggestion, how about MPI_Win_yield for this semantic?

In an MPI implementation, background progress is analogous to preemptive multitasking in a time-sharing system. Without background progress, we only have cooperative multitasking. A cooperative multitasking system requires processes periodically yield to others in order to multitask, and in MPI this is usually analogized as "engaging progress."

MPI_Win_yield could potentially even be meaningful for other window access regimes, acting as a hint to the MPI implementation that now would be a fine time to conduct work on pending RMA operations with respect to the provided window, even if non-passive-target access regimes contain mandatory blocking calls that ensure progress eventually.

An MPI implementation that ensures background progress MAY implement Win_yield as a no-op. An MPI implementation that does not ensure background progress MUST ensure that all pending RMA operations on a passive-target window complete when Win_yield is called in a loop, with changes visible to local loads/stores after MPI_Win_sync as otherwise appropriate. An implementation MAY use Win_yield as a performance hint when a window uses another synchronization approach (fence/PSCW), but this is not required and a no-op is also conformant.

devreal commented 3 months ago

@jeffhammond and I had a quick chat yesterday about this. We came up with the following:

I will try to put that into a PR over the next couple weeks (I'll be on the road throughout most of April) so we can discuss the details.