jsquyres commented 5 years ago

@dholmes-epcc-ed-ac-uk and I were talking about an issue the other night at dinner, and I wanted to record it because it's a serious issue that needs to get fixed in MPI next.

MPI-3.1 section 3.5 p41:10-17 states:

If a process has a single thread of execution, then any two communications executed by this process are ordered. On the other hand, if the process is multithreaded, then the semantics of thread execution may not define a relative order between two send operations executed by two distinct threads. The operations are logically concurrent, even if one physically precedes the other. In such a case, the two messages sent can be received in any order. Similarly, if two receive operations that are logically concurrent receive two successively sent messages, then the two messages can match the two receives in either order.

The problematic text states that any operations on different threads are “logically concurrent." Sometimes that is because the thread execution does not define an order. But even if there is a guaranteed order (which is perhaps what the phrase "physical precedes" means?) then MPI still considers them to be “logically concurrent”. For example, even if there is a thread synchronization between the operations, or perhaps an extremely long wall-clock time between the operations, MPI is still permitted to consider those operations “logically concurrent.” This is bad because MPI is permitted to deliver “logically concurrent” messages in any order, which is going to astonish users (and implementors).

Here's an example:

MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
assert(provided == MPI_THREAD_MULTIPLE);
char a = 111;
char b = 222;

// Thread 1                              // Thread 2
MPI_Send(&a, 1, MPI_CHAR, 1, 2, comm1);
sleep(60);
thread_barrier();                        thread_barrier();
                                         MPI_Send(&b, 1, MPI_CHAR, 1, 2, comm1);

According to MPI-3.1 3.5, these two sends are logically concurrent, and it is permitted for the b message to be received at the receiver before the a message.

Note: the sleep(60) is actually unnecessary in this example -- it's just insult-added-to-injury to drive home the point.

Here's another example (that was sent across the point-to-point working group list):

void test(int rank) {
   int msg = 0;
   if (rank == 0) {
#pragma omp parallel num_threads(2)
#pragma omp critical
       {
           MPI_Send(&msg, 1, MPI_INT, 1, 42, MPI_COMM_WORLD);
           msg++;
       }
   } else if (rank == 1) {
       MPI_Recv(&msg, 1, MPI_INT, 0, 42, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
       printf("Received %i\n", msg);
       MPI_Recv(&msg, 1, MPI_INT, 0, 42, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
       printf("Received %i\n", msg);
   }
}

What does this print: 0, 1, or 1, 0?

According to MPI-3.1 section 3.5, both are possible. 😲

Resources

MPI Forum Plenary, Dec. 7, 2021: https://github.com/mpi-forum/mpi-issues/files/7670702/The.Logically.Concurrent.Issue.pdf
Slides from Sept. '21 WG Discussion: https://docs.google.com/presentation/d/1JLoaJqIAGZ93NR7lprDEq_N_NpjmJB6EL0HM2gqt-Kc/edit?usp=sharing

PR: https://github.com/mpi-forum/mpi-standard/pull/748

hzhou commented 5 years ago

This issue has a few layers of principles or definitions that without explicit consensus, discussions can easily devolve into cross-talking. May I suggest the following points for potential consensus?

The least astonishment principle. Concurrent programming is rarely intuitive. Our mind doesn't work concurrently. The "astonishing score" is largely affected by personal experience. Therefore, the least astonishing metric is a very unreliable metric. With clear enough definitions and explanations, we can turn astonishing into understanding or even expected.
The need to define "logical concurrency" and "logical in-concurrency". The word "logical" is implicitly defined and at different levels, it could bear different meanings. For example, the programmer may have no requirement of the ordering of certain events, then these events are logically concurrent in the programmer's mind even in a serial code. On the other hand, at the library level, as all events are enqueued, they all appear with an order, thus, are never logically concurrent -- that is an implementation level definition. Finally, at the physical level, the concurrency is a matter of your physical resolution and I believe ultimately there is never physical concurrency. Therefore, just saying "logically concurrent" is very ambiguous.
What does the current text say and do we accept? So we need to define "logical concurrency" and its counterpart -- "logical in-concurrency". The current text seems to read that two send operations executed by two distinct threads are "logically concurrent" by definition, and operations executed in the same thread are "logically in-concurrent" (maybe "logically serial"). Such definitions are fine in my opinion. The "astonishment" comes before one accepts such definitions. We can add text to make this definition more explicit and maybe add a rationale to explain the necessity of such a definition.
Is the mention of "logical concurrency" necessary at all? Let's agree on the point of the examples. Both examples hinge on one question: do we want to restrict implementations to enqueue atomically send or recv at function invocation time, potentially with an implicit synchronization call? -- Is this interpretation correct? This is a rather clear requirement on the implementation and we can specify this without mention of "logical concurrency". With current text, both examples are "logically concurrent" and either behavior should be accepted as "non-astonishing". We use such examples to train our concurrent programmers and to correct their "astonishing" metrics all the time. [The second example touches one more philosophy. #pragma is used for performance tuning, it should not alter the logical meaning of the code. Edit: withdrawn -- logical meaning ill-defined, off-topic.]
Finally, we can discuss on what final text would be, assuming we agreed on all above -- one way or the other.

Sorry that I mixed in my opinions within each point. Also, I edited heavily as I re-read my own comments.

jsquyres commented 5 years ago

Defining that MPI operations separated by an infinite amount of wall clock time are "logically concurrent" (and therefore can be arbitrarily ordered in reality) adds no clarity to the standard, serves no practical purpose, and creates confusion about how app developers are supposed to write coherent multi-threaded MPI applications.

hzhou commented 5 years ago

There is no "infinite" amount of wall clock. "infinite" only exist logically and the sentence "operations separated by an infinite time" is the same as "logically in-concurrent", which requires definition -- back to square one. Do you want to refine your "infinite" to a finite one, 1 second or 1 minute?

PS: of course the question is a trap. There is no logical distinction between 1 second or 1 minute or 1 femtosecond. Logical concurrency as in its name is logical; physical time separation is irrelevant. We can't really define logical concurrency to physical one -as that is ultimately impossible, which only leaves the choice of completely separating "logical concurrency" from the physical one.

hzhou commented 5 years ago

A different and more relevant question is: Can we come up with a realistic example that the order of send in distinct threads are critical (for application's purpose) and there is no alternative cleaner way to express such ordering?

In the first example, since the code is willing to call thread_barrier, does it make more sense to move both send into the same thread? Similarly, in the second example, since the code is willing to do #pragma omp critical, why not call two sends in the same thread?

mhoemmen commented 5 years ago

#pragma is used for performance tuning, it should not alter the logical meaning of the code.

If the parallel region actually happens, then #pragma omp critical very definitely does alter the meaning of the code. In theory, one could remove all the #pragma omp * and the could would still work (maybe-ish, if you ignore the OpenMP API and interactions with other pragmas), but you can't just remove some of the pragmas.

hzhou commented 5 years ago

@mhoemmen I agree that #pragma omp critical does alter the results of the code. I am asking whether the meaning should. A code that depends on #pragma for its intended meaning, in my opinion, is bad code. Should we encourage such bad code -- note that it is not bad code if the meaning does not depend on actual realization; and bad codes might well be practical code -- by guarantee implementation level behaviors?

PS: on further thought, #pragma omp critical is kind of colored with meaning rather than simply performance. I don't quite like it, but I guess it is similar to the explicit thread_barrier calls -- might as well be compiled that way. I withdraw my earlier comment on this particular #pragma.

jeffhammond commented 5 years ago

@mhoemmen

pragma omp parallel num_threads(1000)

{

pragma omp critical

{ printf(“you are wrong\n”); } }

A more realistic example might involve num_threads(2) and a send-recv pair of 100MB.

jeffhammond commented 5 years ago

Sorry @mhoemmen, I think I was agreeing you in regards to your disagreement with the previous post. I’m not very good at parsing GitHub comments using the email interface.

mhoemmen commented 5 years ago

@jeffhammond It's cool :D

jsquyres commented 5 years ago

@hzhou The current MPI specification text currently allows for an unbounded / infinite amount of time to pass between MPI operations in different threads, and yet will allow the implementation to re-order them. That is flat-out terrible.

It does not matter whether you agree with the style of the two examples that were provided. They are valid MPI applications, and demonstrate the issue clearly.

You can continue to argue that the text's current definition of "logically concurrent" is fine/good, but that's just academic. The definition in the standard adds no clarity, serves no practical purpose, and creates confusion about how app developers are supposed to write coherent multi-threaded MPI applications.

dholmes-epcc-ed-ac-uk commented 5 years ago

@jsquyres

serves no purpose

If the two threads were assigned completely independent communication resources (HW & SW) and their receivers were also independent, then enforcing any ordering that spans both threads adds overhead that could, in theory, be avoided. The messages from really/properly concurrent threads would need sequence numbers to record the order at the sender MPI process, then after taking independent non-deterministic routes (permitted in either case), the receiver must enforce the original ordering using the sequence numbers to re-create the sender MPI process order in the matching queue.

The current text permits that "re-create ordering" overhead to be avoided by allowing logically concurrent messages to be delivered in any order. It has the side-effect of removing any guarantee of ordering, even in pathological cases, such as the examples given. My understanding is that this was known and intended (both the potential optimisation and the side-effect) by the original authors of this text (but I might be wrong). Congestion control could add an arbitrary holding delay to one network route - why should other (concurrent/independent) messages be delayed, even if they left the sender later than the held message?

The API design decision here is a trade-off between easy-to-use but lower performance (enforce the order) and hard-to-use but better performance (don't enforce the order).

Easy/hard-to-use is subjective and based on a clear explanation of the intended/expected behaviour. Higher/lower performance ought to be objectively measurable.

dholmes-epcc-ed-ac-uk commented 5 years ago

In the face-to-face meeting, @schulzm and I decided that:

1) clarifying the text so that the permission to ignore ordering is more explicit/obvious (i.e. what some folk think it already says) would make it clear that existing programs that rely on ordering between threads are actually erroneous, even if they accidentally work with current MPI library implementations. No MPI library would need to change behaviour because it would be correct to honour the order or to ignore it. Some (erroneous) user programs may (technically) need to change.

2) modifying the text so that the MPI library is required to record and re-create the order of "logically concurrent" messages (i.e. what other folk think it already says) would make it clear that existing programs are fine irrespective of whether they rely on ordering or not but that some (possibly hypothetical) MPI implementations (those that deliver logically concurrent messages in arbitrary order) would become incorrect unless they introduced some mechanism to enforce this new extension of the ordering rule, probably at the expense of reducing performance.

Thus, I agree with two thirds of @jsquyres' summary:

The definition in the standard adds no clarity~~, serves no practical purpose,~~ and creates confusion about how app developers are supposed to write coherent multi-threaded MPI applications.

hzhou commented 5 years ago

Despite complete different wording, I believe @dholmes-epcc-ed-ac-uk and I have the same understanding (of the problem). On that understanding, my personal opinion is heavily leaning to his first option, but my earlier post is mostly trying to clarify that understanding.

dholmes-epcc-ed-ac-uk commented 5 years ago

@hzhou

Both examples hinge on one question: do we want to restrict implementations to enqueue atomically send or recv at function invocation time, potentially with an implicit synchronization call?

Is this interpretation correct?

No, it is not a correct interpretation. (Perhaps it is and I have simply misinterpreted your question.)

1) there is no enqueue because the communication subsystem is not assumed to be a queue - it might be multi-rail, it might be multi-route, it might be non-deterministic.

2) all MPI function calls are "atomic" in a thread-compliant MPI implementation, in as much as "the outcome will be as if the calls executed in some order" (see MPI-4.0-SC18-Draft, section 12.4.1, page 511, point 1)

3) despite these points, the ordering guarantee (if extended) would go further and require that the matching order at the receiver MPI process respect/enforce the function call ordering recorded by the sender MPI process, even the messages were injected into the network in a different order (always permitted), taken longer/shorter routes in the network (always permitted), and/or arrived at the receiver in a different order (always permitted).

hzhou commented 5 years ago

@dholmes-epcc-ed-ac-uk I meant the same thing -- your sentence may be clearer. I meant that the option for MPI Standard text is to specify an implementation detail rather than a concept description (with logical concurrency or infinite wall time). By "enqueue" I meant to record the order at send time. It could be a literal queue or simply sequence number or some other mechanism that ensures the ability to restore the order -- if there is ordering, there is a meta-queue. However, I don't see anyway to get away from the synchronization. The best we can do is to make the global sequence number or queue atomic, which is more strict than simply thread safety. It is the first time that I learned all MPI function calls are "atomic". I guess I have to accept it or it is another discussion.

EDIT: I referenced the text and it says thread-safe, not atomic. Correct me if I am wrong: Thread-safe can be interleaved; atomic strictly can't. It matters when we are talking about ordering where atomic the ordering is well-defined and in an inter-leaved situation, the ordering need be defined. Whether do we need define such ordering and if yes, how to define such ordering is the current point of discussion.

dholmes-epcc-ed-ac-uk commented 5 years ago

@hzhou

I believe ultimately there is never physical concurrency.

This seems intuitively false: consider a multi-core socket, a multi-socket node, a multi-rail NIC, a multi-route fabric - it's turtles all the way down.

Also, even if the two sender threads are multiplexed through a single bottleneck at some point during transmission, this misses the point of the ordering rule, which talks about message matching order rather than injection, transmission, or delivery order.

dholmes-epcc-ed-ac-uk commented 5 years ago

if there is ordering, there is a meta-queue

Personally, I think the ordering rule leads inexorably towards channels or streams, which are unastonishingly ordered by definition. However, that is another story.

hzhou commented 5 years ago

I believe ultimately there is never physical concurrency.

This seems intuitively false, ... it's turtles all the way down.

"Turtles all the way down" leads to "ultimately there is never physical concurrency" right?

If we discuss on the abstract level -- such as turtles all the way down, then the discussion will never end or agree -- depending on which turtle we take the pause. If we abort the philosophical discussion at all and simply define our terms on a technical level, it is will be definite. Defining "concurrency" based on distinct threads or not is one such approaches. Defining the behavior by requiring MPI record the ordering at send invocation time -- the starting point of the function call is another option -- still has ambiguity of which point but enough for our example cases.

dholmes-epcc-ed-ac-uk commented 5 years ago

meta-queue

At the moment, point-to-point send and receive in MPI are half-channel operations. A channel is a FIFO queue, so send and receive operations contribute to meta-half-queues.

Each {send-thread, receive-thread} pairing constitutes a different meta-queue. For each thread, all of its meta-half-queues (its contributions to all meta-queues formed with all other threads) when taken together form a meta-queue.

dholmes-epcc-ed-ac-uk commented 5 years ago

"Turtles all the way down" leads to "ultimately there is never physical concurrency" right?

Quite the opposite?

Two different processes on two different hardware threads, on two different cores, in two different sockets, on two different nodes, in two different cities, ... eventually you must admit that these could actually execute at the same time, i.e. physically concurrently.

hjelmn commented 5 years ago

Honestly, any app that relies on the ordering of messages sent/received from different threads without explicitly enforcing the order is an erroneous code and should be rewritten. So, I would go with option 1. But then, I am explicitly against the current no-overtaking semantic in MPI anyway.

hzhou commented 5 years ago

"Turtles all the way down" leads to "ultimately there is never physical concurrency" right?

Quite the opposite?

Two different processes on two different hardware threads, on two different cores, in two different sockets, on two different nodes, in two different cities, ... eventually you must admit that these could actually execute at the same time, i.e. physically concurrently.

Because you will never reach the "eventual" or the "last turtle", therefore, you'll never reach the ultimate "concurrency" -- which equates to ultimate in-concurrent. ... But this is a recurring philosophical discussion. Shall we agree that we should understand its never ending or pointless nature and avoid such philosophical discussion?

dholmes-epcc-ed-ac-uk commented 5 years ago

@hjelmn the problem is that there is no way to explicitly enforce the order - as shown by the examples. The only option the user has is to marshal all point-to-point calls onto a single thread and rely on the ordering rule, or to use different tags/ranks/communicators.

hjelmn commented 5 years ago

@dholmes-epcc-ed-ac-uk True. Though they could force the order using out-of-band methods (which would be extremely ugly).

dholmes-epcc-ed-ac-uk commented 5 years ago

@hzhou we have different meanings for the word concurrency.

I'm using this one: https://dictionary.cambridge.org/dictionary/english/concurrent

"at the same time"

What do you mean by the word?

dholmes-epcc-ed-ac-uk commented 5 years ago

@hjelmn

Though they could force the order using out-of-band methods (which would be extremely ugly).

How does that work? I'm not sure I agree - please provide an example.

hzhou commented 5 years ago

@hjelmn the problem is that there is no way to explicitly enforce the order - as shown by the examples. The only option the user has is to marshal all point-to-point calls onto a single thread and rely on the ordering rule, or to use different tags/ranks/communicators.

For both examples, the solution is to do all calls (that require ordering) onto a single thread. The code is simpler and explicit. For codes that has to called in distinct threads yet still require ordering, then it is a complex system already, and for clean code, having an explicit scheme using tags is a good option. I mean, the tag itself can record a sequence number; it will be an application explicit scheme rather than an implicit one depends on ambiguous logical text.

hzhou commented 5 years ago

@hzhou we have different meanings for the word concurrency.

I guess you wanted to stop at one of the turtle and ignore all the below, but I kept seeing the next turtle :) Can we reach the agreement that a discussion based on turtles all the way down is not fruitful? (So is the chicken-egg discussion. I may have been confused myself that I was arguing for chicken or egg :))

raffenet commented 5 years ago

There's a clarification in the text regarding how "interthreaded synchronization" is required to order collective operations in concurrent threads.

Matching of collective calls on a communicator, window, or file handle is done according to the order in which the calls are issued at each process. If concurrent threads issue such calls on the same communicator, window or file handle, it is up to the user to make sure the calls are correctly ordered, using interthread synchronization.

It seems MPI implementations must respect "issue order" enforced by thread synchronization for collectives on the same communicator. I cannot find similar text with regards to pt2pt operations using the same comm/rank/tag arguments. To me, it boils down to whether or not that was intention of the authors. While the standard text appears to be unclear, implementations which claim to be thread-compliant do respect this issue order as far as I know. Maybe this was a conservative implementation approach used since collectives decompose into pt2pt operations? Or was the intention all along was for pt2pt to have this same issue order property despite users having the ability to use tags to avoid it?

hzhou commented 5 years ago

@raffenet The quoted text on collective calls is quite clear that implementation need maintain inter-thread ordering. The second option being discussed essentially is to extend that text to pt2pt calls, which seems to be what current implementations do anyway (but researching on alternatives?).

However, the semantics of collective calls do have a synchronization flavor and the function interface do not have additional means to control matching like the tags in pt2pt calls, so I don't quite think that the text on collectives automatically implies intentions on pt2pt calls.

pavanbalaji commented 5 years ago

Sorry, I'm late to the party. As I mentioned on one of the email chains, I think the text is actually not ambiguous (although I don't mind clarifying it further):

On the other hand, if the process is multithreaded, then the semantics of thread execution may not define a relative order between two send operations executed by two distinct threads.

I think the critical part here is "may not define a relative order", i.e., unless the threading model does something to ensure ordering, they can occur in whatever order.

The operations are logically concurrent, even if one physically precedes the other.

Again, this is meant to say: "in cases where there is no order defined by the threading model, the operations are logically concurrent ..."

Without these guarantees, many MPI programs would fail and it would result in a disaster when writing MPI_THREAD_MULTIPLE programs, as others have already pointed out. I might be repeating what others have already said, but we should retain the intended semantics and just clarify it further in the wording (even though I personally think that the wording is already clear enough).

hzhou commented 5 years ago

@pavanbalaji Could you clarify on when or how the threading model defines an order. In particular, does thread_barrier or thread serialized define order. I think you are implying they do, right?

The problem with "threading model defines an order" is the same as "logically concurrent", they are ambiguous in nature. The threading model is potentially observed differently at the different level. For example, unless the send enqueues at the point of invocation -- a synchronization process, it may not retain any information that a thread_barrier was in place.

pavanbalaji commented 5 years ago

Could you clarify on when or how the threading model defines an order. In particular, does thread_barrier or thread serialized define order. I think you are implying they do, right?

Correct.

The problem with "threading model defines an order" is the same as "logically concurrent", they are ambiguous in nature.

Neither of them is ambiguous. Logically concurrent is a very well understood concept in the literature.

unless the send enqueues at the point of invocation

It does. In fact, the return of the send call guarantees local completion.

-- Pavan

jdinan commented 5 years ago

Sorry if this comment is a repeat. The following sentence seems wrong and the paragraph makes much more sense without it.

"The operations are logically concurrent, even if one physically precedes the other."

I had always interpreted the "physically precedes" bit to be an awkwardly worded statement that a nondeterministic ordering of MPI calls by threads (even if they are non-overlapping in time) will result in different match orders each time the program is run.

artpol84 commented 5 years ago

I was trying to summarize the discussion for myself, but I couldn't.

@dholmes-epcc-ed-ac-uk Are the 2 items in https://github.com/mpi-forum/mpi-issues/issues/117#issuecomment-445895201 considered as alternatives? Based on my interpretation and on the subsequent discussion context, looks like they are. But I'd like to make sure.

I expressed my understanding of the text in the duplicating ticket https://github.com/mpiwg-p2p/p2p-issues/issues/3. I think it is aligned with @pavanbalaji post https://github.com/mpi-forum/mpi-issues/issues/117#issuecomment-448138415.

While having the ability to send/receive in arbitrary order allows for better performance and I've seen benchmarks that would highly benefit from this, it is counter-intuitive for the application writers in my opinion.

Are there any plans on fixing this? Were there any additional discussions on f2f meetings?

dholmes-epcc-ed-ac-uk commented 5 years ago

@artpol84 I believe the most recent position taken by the Forum-at-large is the comment you reference: https://github.com/mpi-forum/mpi-issues/issues/117#issuecomment-445895201

The two options described there are, IMHO, alternatives - representing the two interpretations of the existing text.

My own opinion is that option (1) represents the de jure meaning - the strict letter-of-the-document meaning - whereas option (2) represents the de facto meaning - the commonly-held belief about what the text says, or should say.

As for plans to "fix" it: the Forum must first chose option (1) XOR option (2) then argue about precise wording of replacement text that implements that choice. I think that means whichever "fix" will miss the current MPI-4.0 timeline, unfortunately.

artpol84 commented 5 years ago

Thank you,

I agree about de-jure and de-facto. Any plans to have this discussion on f2f? Even though it’ll miss MPI 4, it is still good to get it going.

jsquyres commented 5 years ago

FWIW: I think that the focus should be on defining exactly what it is that we want, and then laying out that clearly in text (probably replacing the old text in the process).

Continued bickering over exactly what the current text means is not useful (case in point: if the members of the MPI Forum can't agree on what the current text means, then it's not good text). So forget the old text. Decide exactly what is wanted, and then write new text to say that. If the new text ends up being remarkably similar to the old text (e.g., just clarifying the old text) -- great. But don't limit yourself to the premise that all we can do is "clarify" the old text.

dholmes-epcc-ed-ac-uk commented 5 years ago

@jsquyres the options cited from earlier in this discussion were an attempt to do exactly what you suggest.

My understanding is that two distinct camps formed and expressed their opinions about what the text should say:

Camp 1) asserted that the new wording should say exactly what the current text says, i.e. MPI can ignore physical execution ordering between threads even when the user somehow enforces that ordering.

Camp 2) asserted that the new wording should say exactly what the current text says, i.e. MPI must respect the execution ordering of send operations at the sender when matching with receive operations at the receiver.

IMHO, this means we need another (face-to-face) discussion.

artpol84 commented 5 years ago

I agree with @jsquyres about agreeing on the intended semantics first and then doing a corresponding amendment of the spec. However ...

There is another item that I feel related to this discussion that comes later in section 3.5 and honestly sounds a little confusing to me.

Similarly, suppose that a receive was posted by a multithreaded process. Then it is possible that messages that match this receive are repeatedly received, yet the receive is never satisfied, because it is overtaken by other receives posted at this node (by other executing threads). It is the programmer’s responsibility to prevent starvation in such situations.

First, I'd like to clarify the scenario. Here is how I understand it:

I assume that receive was posted means that, for example, MPI_Irecv was completed, but MPI implementation is free to delay the insertion of the envelope into its internal tag matching engine.
And now there is another "aggressive" thread that actively calls MPI_Irecv overtaking the receive from the other thread.

If my understanding is correct - this sentence also belongs to the discussion. And note It is the programmer’s responsibility to prevent starvation in such situations. which assumes that MPI user has the leverages to control this situation. However, if we will proceed with the "Camp 1" option there is no way a user can do this.

hzhou commented 5 years ago

However, if we will proceed with the "Camp 1" option there is no way a user can do this.

There is way! The user can ensure the other thread do not aggressively issue MPI_Irecv that will overtake another thread's MPI_Irecv. Or, the user can serialize the issuing of MPI_Irecv's into one thread. Or, the user can express the parallelism explicitly by using different communicator or tag encoding to separate threads.

hzhou commented 5 years ago

I think it is back to the interpretation of "logical concurrency". I would assert that all operations issued in different threads/process are "logical concurrent". The synchronization of barrier only asserts the order between operations and the "barrier" within individual thread; and the propagation of this synchronization to between multiple threads need be explicitly carried out. For example, to honor the barrier synchronization for send queues, there need be explicit honoring of the order between a send operation, its enqueue operation and the barrier operation. Let's imagine a loose implementation that the enqueuing don't have to happen before the return of MPI_Send. Then to honor the barrier, we need make the enqueue operation aware of the barrier (or more like the other way around) that the barrier operation need explicitly clear all outstanding enqueue. So while a synchronization barrier offers a mechanism to synchronize certain operations, it is not a given. By default, operations from different threads are "logically concurrent", with or without "barrier" synchronizations.

To an extent, one can even argue that no "true logical concurrency" exist. Fine, but do understand the effort in honoring this true logical in-concurrency can be enormous. Now we can relax the extent of "logical concurrency" to around a barrier operation. Fine, but do note that there are cost associated with that, even we may have been used to such cost. If we don't want to specify what kind of barrier operation it is, then we need ensure the "atomicity" of these operation regarding all potential user operations. "Atomicity" excludes a whole class of optimizations. Or, we can even further relax the notion and assert all operations in multiple threads are always "logically concurrent" regardless of barriers. That gives the implementation maximum flexibility to get performance. Note that the user always/still can express their need of "in-concurrency" by serialize/enqueuing in their user code, or they can explicitly express their parallelism with communicators and tags, so we are not taking any possibility away from the user -- maybe some convenience. I am in the camp that if the user is writing multi-thread applications yet want to assume "implicit in-concurrency", they are really doing something wrong.

artpol84 commented 5 years ago

There is way! The user can ensure the other thread do not aggressively issue MPI_Irecv that will overtake another thread's MPI_Irecv.

However, if the scenario described is possible, the user cannot go do not aggressively issue MPI_Irecv because even in this case there is no guarantee that it won't be overtaken. No matter how long you wait in between MPI_Recv's it's still the possibility that it won't be sufficient enough (see https://github.com/mpi-forum/mpi-issues/issues/117#issuecomment-445476618).

However, if we will proceed with the "Camp 1" option there is no way a user can do this.

Or, the user can serialize the issuing of MPI_Irecv's into one thread. Or, the user can express the parallelism explicitly by using different communicator or tag encoding to separate threads.

I slightly overstated and do agree that those options will work.

hzhou commented 5 years ago

However, if the scenario described is possible, the user cannot go do not aggressively issue MPI_Irecvbecause even in this case there is no guarantee that it won't be overtaken. No matter how long you wait in between MPI_Recv's it's still the possibility that it won't be sufficient enough (see #117 (comment)).

"how long you wait" is always the wrong question. Use atomic counters or locks if user want to coordinate between threads.

artpol84 commented 5 years ago

"how long you wait" is always the wrong question. Use atomic counters or locks if user want to coordinate between threads.

Following the previous discussion (https://github.com/mpi-forum/mpi-issues/issues/117#issuecomment-445476618) I incapsulated inter-thread synchronizations into the "wait category" as well. Sorry for not clarifying. According to "Camp 1" standpoint, MPI ignores all inter-thread synchronization mechanisms.

What I was trying to say is: even if locks are used to arbitrate between MPI_Recv calls, it doesn't guarantee you anything, because internally MPI implementation is not obligated to immediately "apply" the receive request.

hzhou commented 5 years ago

"how long you wait" is always the wrong question. Use atomic counters or locks if user want to coordinate between threads.

Following the previous discussion (#117 (comment)) I incapsulated inter-thread synchronizations into the "wait category" as well. Sorry for not clarifying. According to "Camp 1" standpoint, MPI ignores all inter-thread synchronization mechanisms.

What I was trying to say is: even if locks are used to arbitrate between MPI_Recv calls, it doesn't guarantee you anything, because internally MPI implementation is not obligated to immediately "apply" the receive request.

I was thinking about your argument the minute I posted my reply (that was too quick) :). Yes, I agree. If the MPI implementation does not provide explicit acknowledge point when the request is enqueued, then the user code cannot synchronize them explicitly. the user code has to synchronize on the completion point -- MPI_Test or MPI_Wait. With counters, which was what I had in mind, user can explicitly balance the completion of receives. Now your scenarios are each thread issuing multiple MPI_Irecv with identical signature and you want to balance them before making progress (via MPI_Test or MPI_Wait)? I need to think about the validity of this scenario. If the use case is sufficiently justified, then it is an argument for requiring the implementation to provide acknowledge point of queue insertion.

artpol84 commented 5 years ago

There is an ongoing discussion of the explicit one-sided ordering primitive (#135). @hzhou I think that if we will follow “Camp1” route, we may need to introduce an ordering primitive for user/implementation synchronization (which I think is similar to the ack point of the queue insertion in your previous comment).

artpol84 commented 5 years ago

In the light of #135 - does the “logical concurrency “ applies to RMA in the current spec? My understanding it is.

hzhou commented 5 years ago

I think that if we will follow “Camp1” route, we may need to introduce an ordering primitive for user/implementation synchronization (which I think is similar to the ack point of the queue insertion in your previous comment).

For pt2pt at least, MPI_Test is semantically required have local send/recv inserted at least to the local thread. So if user will coordinate on their calling MPI_Test, that is already much better than wait forever. The argument for such delayed insertion is that the implementation don't have to call lock on every MPI_Isend or MPI_Irecv; the implementation don't even have to call lock on MPI_Test -- the implementation can query the progress for its local request first, and only do global progress (which requires lock) after a certain amount of attempts. I am not saying this definitely performs better -- it depends on lower-level transport as well as higher-level application logic -- but it is valid flexibility that implementation like to have.

In the light of #135 - does the “logical concurrency “ applies to RMA in the current spec? My understanding it is.

On the RMA side, there isn't a MPI_Test. MPI_Win_flush is more like MPI_Waitall. I think the argument of asking for a MPI_Win_order is a concession that there is "logical concurrency" in the RMA operation -- a camp 1 opinion.

artpol84 commented 5 years ago

On the RMA side, there isn't a MPI_Test. MPI_Win_flush is more like MPI_Waitall. I think the argument of asking for a MPI_Win_order is a concession that there is "logical concurrency" in the RMA operation -- a camp 1 opinion.

No, if I understand correctly, #135 is not about threads at all. It is about ensuring the order of the remote completions of the RMA operations to implement the data+flag scenario when setting the flag indicates that the data is ready.

mpi-forum / mpi-issues

'Logically concurrent' isn't #117

Resources

pragma omp parallel num_threads(1000)

pragma omp critical