Outstanding issues for federated execution

Soroosh129 commented 3 years ago

Here I will summarize an ongoing discussion I had with Edward about the distributed execution.

First, to clarify what it means to have a logical or physical connection, here are the tentative definitions for each for both the distributed and centralized cases:

Logical Connections (indicated via a.out -> b.in after d): In-order message delivery between two federates where a message sent at logical time t by the sender with a delay of d imposed using after will be scheduled on the receiver exactly at logical time t+d. If the receiving federate is unable to assign timestamp t+d (its logical time is already greater than this), then we have a fault condition. With centralized coordination, such a fault condition should not arise. With decentralized coordination, it can arise if the assumed bounds (clock sync, network latency, WCET) are violated.

Physical Connections (indicated via a.out ~> b.in after d): Message delivery where a message sent at logical time t by the sender with a delay of d imposed using after will be scheduled on the receiver at a logical time that is the larger of t+d and T, the current physical time at the receiving end. The timestamp, therefore, can be arbitrarily larger than t+d without being treated as a fault condition.

The combination of logical/physical connections and decentralized/centralized execution modes creates four unique implementation cases, where each have their own nuances. I will flesh out my interpretation of each next (@edwardalee please feel free to change any part of this that is incorrect) along with the outstanding issues.

Logical connection in centralized mode: Out of the four possibilities, this one is the most fleshed out and tested case. Messages are stamped and sent through the Run-Time Infrastructure (RTI). Whenever there are no reactions in progress and no reactions on the reaction queue, the federate will attempt to advance time and if the federate has downstream federates, it will also inform the RTI of this attempt by calling __logical_time_complete(). However, for advancing time, in the case where the federate has upstream or downstream federates, it will wait for a TIME_ADVANCE_GRANT from the RTI.

Physical connection in centralized and decentralized mode: With the current implementation, federates can directly send messages to other federates via a physical connection. If a federate has no upstream logical connections, then will not wait for a TIME_ADVANCE_GRANT from the RTI. Thus, logical time at the receiving federate can be larger than the logical time at the sending federate by an arbitrary amount. It could even reach its timeout time (stop_time in the code).

Logical connection in decentralized mode: In theory, this should behave similar to the centralized mode at a high level. However, there are a few outstanding issues.

1- Advancing time: The federates need to figure out how long to wait before it is safe to advance time on their own in lieu of a TIME_ADVANCE_GRANT message from the RTI. This can be done by calculating a Safe-To-Process (STP) delay for each logical time advance based on clock synchronization error, physical delay imposed by the communication channel, and physical delay imposed by upstream reactions. The first issue is: how could we delineate STP in a Lingua Franca program?

The STP has three main components: 1- A bound on execution time (WCET) of upstream reactions (which depends on topology), 2- communication latency bound, and 3- clock synchronization bound.

Solutions: The current proposals include:

Having a target property such as:
```
target C {
latency_bound: 30 msec
};
```
Having an under keyword on connections: a.out -> b.in under 30 msec; where under is the combination of network communication delay + bound on clock synchronization error + WCET of the sending and receiving reactions. However, what would be the effect of imposing an after on this connection (e.g., a.out -> b.in after 30 msec under 30 msec)?
As @lhstrh suggested, the purpose of under can be achieved by having a bound input on the error condition which replaces the purpose of under (see below).
Use after in lieu of STP. Therefore, each connection should have an after clause.

2- Handing error conditions: Since STP is manually calculated, it is possible for federates to not wait long enough. This would create a situation where the logical time of the receiving federate will move on but a message with an earlier timestamp would arrive from the sender federate which is an error condition. Thus, the second issue is where should the fault handler be?

As @lhstrh has suggested, this could go on the receiving reaction (e.g., the reaction that has b.in as a trigger using a lateness condition).

For under, this would be:
```
reaction(in) {=
=} lateness {=
=}
```
For a bound variable, this would be:
```
reaction(in) {=
=} lateness (bound) {=
=}
```
where bound is the combination of network communication latency + bound on clock synchronization + WCET of the sending and receiving reactions.

In both cases the deadline of the upstream reactions of the built-in sender reaction will be used as a substitute for the bound on the execution time of upstream reactions.

edwardalee commented 3 years ago

TLDR: Messages sent between federates using a physical connection may get dropped.

@Soroosh129 and I have discovered an interesting conundrum regarding physical connections and federated execution. Suppose you have a physical connection:

   y.out ~> x.in after a;

where y and x are distinct federates. Suppose a message is generated at logical time t at federate y. The received message will be assigned timestamp max(t+a, T), where T is the physical time at the receiving end. So far so good.

However, suppose the federation has a timeout parameter so that execution is supposed to end at logical time s (for stop time). First, the sending code should check and not send the message if t+a > s. I don't think this check is being done now.

However, there is no way to prevent the situation where T > s. Assume first that the target parameter fast is not set to true. This means that at each federate, it is always true that T >= t, physical time is ahead of logical time. Hence, a message that is sent before the stop time may be received after the stop time. Should this message be dropped? Or should it be assigned a timestamp equal to the stop time? If the latter, then the assigned timestamp is min(s, max(t+a, T)).

If we choose the semantics that sent messages are always received, then we a second problem. How do we shut down federation x? Federation x will have a live thread that is listening for physical inputs on a socket. When can it stop listening? I think we need for the sending federate to send an EOF on the socket when it reaches its stop time. This will interpreted as a promise to send no more messages on that socket, so the receiving end can close the socket and terminate the thread.It then needs to delay entering its shutdown phase until it has received such an EOF from all (physically) upstream federates.

If the fast value is true, the problem is worse because the receiving federate will very quickly reach logical time s. Currently, the receiving federate most likely gets none of the sent messages (set fast to true in the DistributedCountPhysical.lf test).

lhstrh commented 3 years ago

I think it makes sense that if you use a physical connection with a logical delay that is not sufficient to mask the safe-to-process threshold, a message could get lost in the end-of-execution scenario that you describe. In the distributed coordination setting this would perhaps be more obvious, but it seems fundamental. That is, unless we ask the RTI when it is safe to shut down and implement a wait. Particularly in the centralized coordination scheme this seems easy to do. In fact, I think we concluded months ago that stop should be implemented this way, and it would not shutdown the federate immediately, but instead do this at a later time when it is deemed to be safe (stop should return the time at which shutdown will be present, 0 if it will be present at the next microstep).

Regarding the issue being exacerbated by the fast flag; I fail to see the point of having "fast" federates with physical connections between them. Do we have a usecase for this?

lhstrh commented 3 years ago

Also see #185.

Soroosh129 commented 3 years ago

First, I think it's important to note that we don't have a mechanism to impose a safe-to-process threshold at the receiving end at the moment. Imposing an STP equal to the delay in after can be a good preliminary way of getting around this problem until we have a more sophisticated way of calculating the STP.

Second, RTI does not currently have a way of knowing when it is safe to stop if there are only physical connections. I think this is a good distinction to have between physical and logical connections. It is not critical if some physical messages get dropped so there is no reason to get the RTI involved in my opinion.

Finally, I think fast is especially helpful for physical connections because it means the receiving federate can process messages as soon as they arrive. If going by the mantra of physical connections being less critical (or not at all), it should be okay to process them as soon as possible if the developer explicitly allows this behavior.

lhstrh commented 3 years ago

AFAIK, the distinction between physical and logical connections has been that if you use the former you don't care about the timestamps, and when you use the latter you do care about the timestamps. It is new to me that it would be OK for physical connections to drop messages, and I'm not convinced that we really want that. The solution I outlined above and in #185 could prevent the loss of data.

If going by the mantra of physical connections being less critical (or not at all)

To my understanding, the lessened criticality to applies to the tags of messages, not their payload.

I still don't understand the point of using "fast." If you want messages coming through a physical connection to be handled immediately, then you'd simply set the delay on the connection to zero. That way, the resulting events should always acquire T as their tag, meaning they could be handled immediately even if the "fast" option wasn't enabled. Is there something I'm missing?

edwardalee commented 3 years ago

The "fast" issue arises even when the delay on the connection is zero. These are completely orthogonal. Setting "fast" to true simply allows logical time to get ahead of physical time.

lhstrh commented 3 years ago

I understand what happens when "fast" is enabled, but since you brought it up, I simply questioned the use of "fast" in a federated execution.

As to achieving an orderly shutdown sequence in the centralized case, how about the following: if a federate calls stop, it instructs the RTI to terminate the execution of all the other federates in topological order. The RTI must not shut down a downstream federate until all upstream federates have asserted they are done sending messages. Then there could still be a race between in-flight messages from terminated senders and the instruction from the RTI to terminate the next federate. This could be handled by requiring the receiving federates confirm receipt of a special last message before letting the sender report to the RTI that it's done executing.

Soroosh129 commented 3 years ago

For physical connections, the federate itself can send this stop directly to other federates by sending an EOF. The receiving federate will not stop until (1) all incoming physical connections are closed, (2) it receives a stop request from the RTI in the case where there are logical connections. Both conditions must be met. My point is that we wouldn't need the RTI since closing a physical connection does not automatically result in shutdown of the downstream federate.

lhstrh commented 3 years ago

Do we have an implementation of distributed coordination yet? I was under the impression that we didn't, and therefore sketched a solution for the centralized case that uses an RTI. I agree that federates can also implement the solution among themselves. But before we discuss implementation details, does my proposal achieve the behavior we want?

My point is that we wouldn't need the RTI since closing a physical connection does not automatically result in shutdown of the downstream federate.

Sorry, I didn't understand the premise (but I think I already agreed with the conclusion).

Soroosh129 commented 3 years ago

I agree with your proposal. However, physical connections don't have a say in initiating a stop or shutdown. They can only impose a physical delay at the time of shutdown once it is already decided.

Also, I think for physical connections, very little would change in the distributed coordination case.

edwardalee commented 3 years ago

My proposal:

A timeout value given as a target parameter will be treated like stop() in that it does not actually determine the logical time at which a federated execution stops in the case that there are physical connections. Instead, it will impose an upper bound on the timestamp of any physical message launched into the network. When a federate reaches its timeout, it will send an EOF on each outgoing physical message socket.

A federate that has incoming physical connections will keep advancing its logical time and processing events past the timeout value as long as any incoming physical connection is still open. When it receives an incoming EOF on such a connection, it will close it. When the last one is closed, it will stop.

This policy preserves all messages. Moreover, every message has timestamps assigned the same way (there are no special end-of-life timestamps bunched around the timeout time). The price is that the logical stop time is imprecise.

lhstrh commented 3 years ago

I'm not sure I agree that all messages are preserved with this solution. I appears to me that rather than messages being lost because the receiver stopped executing, the messages now get lost because the sender refuses to launch them into the network.

I think the only point(s) in the topology where it makes sense for a federate to have a predefined tag at which no more messages are being sent would be one(s) where there are no incoming physical connections. For all those that do have incoming connections, it seems odd to me that reactions would still be triggered but their effects be ignored. I think we should not ignore those effects and instead have a staggered kind of shutdown sequence like I proposed earlier.

edwardalee commented 3 years ago

In my proposal, the sender will refuse to launch the message only if its timestamp is greater than the timeout time.

Also, I'm not proposing that the effects are ignored at the receiving end. I'm proposing that the receiving end extend its stop time until it is sure it has received all incoming messages (it receives an EOF on each incoming socket).

lhstrh commented 3 years ago

In my proposal, the sender will refuse to launch the message only if its timestamp is greater than the timeout time.

Then I understood the proposal correctly. I don't think the sending of messages should ever be refused. I mean, by the same token, messages also only get lost when the system is in the process of terminating. If this is not considered a problem, then we don't need a solution at all.

I'm proposing that the receiving end extend its stop time until it is sure it has received all incoming messages (it receives an EOF on each incoming socket).

I think that a federate should not send an EOF before it has received an EOF from all upstream senders, or else it will have to ignore the effects of reactions triggered by messages from upstream federates. I don't see how this is any better than receivers shutting down while they are still being sent messages...

edwardalee commented 3 years ago

Revised proposal (and a resulting problem):

A timeout value given as a target parameter will be treated like stop() in that it does not actually determine the logical time at which a federated execution stops in the case that there are physical connections. For a federate with no incoming physical actions, it will impose an upper bound on the timestamp of any physical message launched into the network. When a federate reaches its timeout, it will send an EOF on each outgoing physical message socket. (Note that it is already true in a non-federated execution that events that you try to schedule to occur after the timeout time do not get put on the event queue.)

A federate that has incoming physical connections will keep advancing its logical time and processing events past the timeout value as long as any incoming physical connection is still open. When it receives an incoming EOF on such a connection, it will close it. When the last one is closed, it will stop. At this point, it will send an EOF.

This policy preserves all messages that would have been sent if the program is not federated.

However, we now have a problem. If I have a cycle of physical connections, then no timeout parameter will cause the program to stop executing. It will never stop because neither federate can send EOF until it receives EOF.

lhstrh commented 3 years ago

This is an interesting problem, and it seems fundamental to the nature of physical connections.

Perhaps we should treat all connections as logical connections once termination has been set in motion?

Soroosh129 commented 3 years ago

Is there no cycle detection for physical connections? Forcing the user to impose an after on one of the connections in case there is a cycle will get rid of this problem.

lhstrh commented 3 years ago

We only reject zero-delay feedback loops, which can only exist through logical connections.

The problem is that messages that go through physical connections get re-timestamped. This prevents causality loops in a non-federated execution context, but may also prevent the program from shutting down if a federate can only send EOFs once it has received EOFs from all upstream federates.

I think we need to be careful here; whether putting a delay in a cycle of physical connections helps avoid the problem depends on physical execution times and network latency, I think.

Soroosh129 commented 3 years ago

Thinking about it a bit more, the solution to this problem might be to still send EOF on all outgoing physical connections with no regard to upstream federates at the conclusion of time initial_t + timeout where initial_t is the agreed upon starting logical time between federates. However, the corresponding physical action for incoming messages should be scheduled at max { t+d, min{T, initial_t + timeout + 1}} where t is the timestamp assigned by the sender and T is the physical time at the receiver.

Edit: I don't think it is necessary to advance logical time past initial_t + timeout + 1 in this scenario.

edwardalee commented 3 years ago

There are two problems with this, however. First, there may be more than one message with timestamp initial_t + timeout + 1. Also, instead of + 1, this should probably use a microstep. To account for multiple messages, we have to be willing to go past initial_t + timeout by an arbitrary number of microsteps. I don't really see how this would be any better than going past initial_t + timeout in metric time and keeping the semantics of physical connections uniform, even during shutdown.

The second problem is that a federate may send an EOF, and then receive an input message that triggers sending another output message on the channel where it just sent an EOF. But that socket will be closed. This is the sort of message dropping that @lhstrh is objecting to.

Soroosh129 commented 3 years ago

For the second problem, I think that sending a message to another federate on a physical connection is in fact scheduling an action on the remote machine at a logical time that is almost always strictly larger than t of the sender. However, this sending a message is treated as an ordinary SET function on the sender, where it does not incur even a microstep delay. Using a SET function on a port on a physical connection will yield to the conclusion that sending a message on a port should mean that the effect is always seen on the other end. I think the first issue is that we are hiding the nature of setting a physical port from the sender.

For the first problem, I think that initial_t + timeout + 1 should be a special timestamp that disallows microsteps. Therefore, downstream reactions on the federate can be triggered. However, schedule functions will have no effect (including sending messages over both physical and logical networks).

Soroosh129 commented 3 years ago

In the process of implementing logical connections in the decentralized case, me and @edwardalee found that some issues would still need to be resolved, some of which overlap with the issues we found with physical connections.

Semantics

To clarify these points, I will use two trivial examples of logical connections in the decentralized case. For the purpose of this discussion, assume t_s is the logical time of the sender at the time the message is sent, d_s is the delay assigned by the sender using after, t_d is the logical time and T_d is the physical time at the receiver when the message arrives.

Example 1: Assume the following logical connection is defined:

Foo.out -> Bar.in

Note that there is no after delay on the connection. For this example, imagine that an STP threshold of 20 msec is present.

At the receiver, the message will carry a timestamp of t_s. If the threshold is accurate, t_s > t_d and the corresponding reaction should be triggered at t_s. If t_s <= t_d, a fault condition has occurred (because STP is too small) and a new timestamp should be assigned.

As per our previous discussions, logical connections should be implemented using existing language features. Let us look at physical and logical actions and how they can manage the fault condition.

Using a physical action, the timestamp assigned to the message received will be max {t_s, T_d}. If T_d > t_s, the logical timestamp of the invoked reaction will be T_d. However, T_d > t_s does not immediately translate into a fault condition. As you might recall, the fault condition is t_s <= t_d. If the STP threshold is accurate, we could end up with a situation where t_s < T_d but t_s > t_d. This can happen because the STP threshold only ensures that t_d + 20 msec < T_d. Thus, the true fault condition would be to check if t_s + 20 msec < T_d. If so, assign t_s, if not, assign T_d. This cannot be easily realized using physical actions. Moreover, even if a true fault condition has occurred (i.e., t_s > t_d), T_d would still not be the smallest possible allowable timestamp since T_d > t_d. The smallest possible allowable timestamp would be T_d - 20 msec. This would make physical actions unsuitable for this type of connection unless significant modifications were made.

Using a logical action, the timestamp assigned to the message received can be max {t_s, t_d}. This can be achieved by calling schedule on the corresponding action with a delay of max {0, t_s - t_d}. The fault condition thus occurs if t_s - t_d < 0 or since t_s > t_d, if the delay is 0. This condition can be checked when the receiver gets the message and a lateness property can be set to be equal to t_d - t_s before calling schedule with a 0 delay (although this would incur a microstep delay, which might be undesirable). Using logical actions in this manner has the benefit of tolerating certain violations of the STP threshold. Imagine if the event queue of the receiver was empty at time t_d' < t_s. Thus, the receiver would keep the logical time t_d' for longer than the STP threshold, Even if a message takes more than 20 msec to arrive, the receiver can still schedule the corresponding action at time t_s.

Example 2: Assume the following logical connection is defined:

Foo.out -> Bar.in after 20 msec;

Note that there is a 20 msec delay imposed using after on the connection. In this case, the delay is large enough so that the STP can be 0. In this scenario, a physical action is sufficient since we can assume that t_s + d_s > T_d unless d_s is not large enough. However, this can also be achieved using a logical action. The receiver can check its logical time, which shall not exceed the physical time, to the same effect. Currently, the receiver calculates a delay as (t_s + d_s) – t_d and feed that delay to a schedule function. In the case where the message is late, this delay will be negative, and thus d_s is not large enough and a fault condition has occurred. This can be indicated by scheduling an action with 0 delay and setting the lateness property of the action to be |(t_s + d_s) – t_d| (this would still incur a microstep delay). This would have the same side-effect as before, where if logical time for some reason lags behind the physical time at the receiver, it is possible that a message that would have been late using a physical action is now on-time using a logical action. However, I don’t think this is an error condition. I think that only if the receiver cannot preserve the timestamp of the message we would have an error condition.

Further Issues

There are three issues on top of the discussion above that needs further attention:

Fast mode: As @edwardalee have pointed out, "fast mode, it seems, always violates the constraint that a federate can’t advance logical to t until T > t + STP. [Therefore], fast mode should be disallowed with decentralized coordination."
Starvation: Currently, it is the case that if keepalive is set to false, a federate with no events on the reaction queue can go ahead and exit (absent the existence of timers). However, as @edwardalee has pointed out, starvation should not be a property of the federate, but a global property of the federation. In other words, only if the federation has no events on any of its reaction queues, the federates can exit.
Timeout: Similar to physical connections, it is a challenge to handle late messages that are sent near the timeout because the lateness property is not bounded.

Miscellaneous Issues

To be consistent with the concepts of real-time systems, we should use tardiness/tardy instead of lateness/late. This is because lateness is defined to be as release_time - deadline which can be negative while tardiness is defined as max { 0 , release_time - deadline } which is always positive. Note that both are relative to deadline but tardiness is the lesser of two evils.

Edit: To prevent incurring a microstep delay, it is possible to call schedule on the logical action with a delay of max {1, t_s - t_d} where 1 is in nanoseconds.

lhstrh commented 3 years ago

On this:

To be consistent with the concepts of real-time systems, we should use tardiness/tardy instead of lateness/late. This is because lateness is defined to be as release_time - deadline which can be negative while tardiness is defined as max { 0 , release_time - deadline } which is always positive. Note that both are relative to deadline but tardiness is the lesser of two evils.

I think it's a great idea to use these terms when referring to the quantities involved in deadline violations, just as you describe. We can use these names for the struct members in which we store those values.

When it comes to the LF syntax for the handlers, my view is that both error conditions (i.e., a reaction being invoked too late with respect to physical time vs. a reaction observing events that have incorrect tags because a message between two federates arrived to late) can be described as deadline violations. The difference between the two situations is that the involved timelines involved are different. The original deadline construct compares logical time against physical time and could therefore be thought of as a "physical deadline," while the deadline observed in Ptides compares the logical time of receipt against the logical time sending and hence could be thought of a "logical deadline." Concretely, I propose to reuse the logical and physical keywords to make the distinction going forward. To maintain backward compatibility, the modifier should be optional and the default interpretation be physical (which also makes sense because it applies under all circumstances, whereas a logical deadline only applies in a distributed federation).

Moreover, because these different kinds of deadline violations are orthogonal (both can happen at the same time), we should invoke the both handlers in case both violations happen simultaneously, and invoke them in declaration order.

Soroosh129 commented 3 years ago

Perhaps, but this is not what physical actions do. I think what you're trying to do is assign an alternative semantics to physical connections that is not based on physical actions. I'm not sure this is a good idea. It would make the semantics less "modular." That said, if we create a special schedule function, we might as well make this adjustment, too.

What I was trying to suggest was to use logical actions to implement logical connections in the decentralized case, where it seems to be more compatible. The only outstanding issue with that would be: who should be responsible for lateness/tardiness calculations. I propose that the receiver can check this fault condition and assign the lateness/tardiness field of the action before calling schedule with a delay of 1 nsec. Once we have the mechanism to send microsteps over the network, we can call schedule with a 0 delay.

Concretely, I propose to reuse the logical and physical keywords to make the distinction going forward.

I do like the concept of logical vs physical deadlines.

lhstrh commented 3 years ago

On this:

I think that only if the receiver cannot preserve the timestamp of the message we would have an error condition.

If by "preserve the timestamp" you mean "assign the tag that is implied by the logical time at the sender's end plus the specified logical delay," then I agree with this statement. However, I fail to see what the difficulty is in detecting this condition in schedule. Could you elaborate on this?

lhstrh commented 3 years ago

On this:

There are three issues on top of the discussion above that needs further attention:

Fast mode: As @edwardalee have pointed out, "fast mode, it seems, always violates the constraint that a federate can’t advance logical to t until T > t + STP. [Therefore], fast mode should be disallowed with decentralized coordination."

I agree.

* **Starvation**: Currently, it is the case that if `keepalive` is set to `false`, a federate with no events on the reaction queue can go ahead and exit (absent the existence of timers). However, as @edwardalee has pointed out, starvation should not be a property of the federate, but a global property of the federation. In other words, only if the federation has no events on any of its reaction queues, the federates can exit.

Yes. Good point. And now we have another consensus problem on our hands :-)

* **Timeout**: Similar to physical connections, it is a challenge to handle late messages that are sent near the `timeout` because the `lateness` property is not bounded.

Yes, this seems covered by the discussion on the wiki.

Soroosh129 commented 3 years ago

Currently, calling schedule on a logical action can only take an equal or greater than zero delay, which can be calculated as t_s - t_d when t_s > t_d. However, when t_s <= t_d, your only option at that point would be to call schedule with a 1 ns delay. Unless the schedule function was modified, there is no way currently for a schedule function acting on a logical action to detect this fault condition. This is in contrast to using a physical action for a logical connection where T_d > t_s can technically be a fault condition at first glance. What I was describing was why this fault condition handling inside schedule acting on a physical action is insufficient.

Do you have a suggestion on how schedule acting on a logical action might detect the fault condition? Currently, my inclination is to put this error detection at the low-level function that handles incoming timed messages and calls schedule.

lhstrh commented 3 years ago

Ah, OK, I understand the problem now. I agree, the tardiness needs to be flagged in the generated code upon calling schedule.

I'm thinking we probably want the log tardiness of the event in the event_t that gets pushed onto the event queue. Is that right? One way to easily do this is to have schedule return a pointer to the event it has scheduled (and NULL if it didn't schedule anything as per the minimum spacing requirement under a "drop" policy, if specified).

edwardalee commented 3 years ago

On Oct 16, 2020, at 2:54 PM, Soroush Bateni notifications@github.com wrote:

I do like the concept of logical vs physical deadlines.

A downside of using this terminology is that we will have to explain the difference between the two before the user will be able to use the deadline keyword. But the deadline keyword is VASTLY simpler than the tardy keyword (or whatever we call it). So before we adopt this proposed language, I would like to see a draft of the documentation.

Edward

lhstrh commented 3 years ago

On Oct 16, 2020, at 2:54 PM, Soroush Bateni @.***> wrote: I do like the concept of logical vs physical deadlines. A downside of using this terminology is that we will have to explain the difference between the two before the user will be able to use the deadline keyword. But the deadline keyword is VASTLY simpler than the tardy keyword (or whatever we call it). So before we adopt this proposed language, I would like to see a draft of the documentation. Edward

We actually don't have to explain it because the extra keyword can be omitted altogether, in which case it means the same as the existing deadline we have today. In other words, the current documentation will not have to change.

edwardalee commented 3 years ago

How will you explain that a physical deadline takes an argument and logical deadline does not?

lhstrh commented 3 years ago

Somewhere in of the documentation of federated execution we get to introduce logical deadlines -- most users probably won't even get that far. We get to describe them as similar to the deadlines described in the section about deadlines, except they use the logical modifier and they take no argument. The reason we call them "logical" deadlines, is because they relate two logical timelines (those of two federates) rather than the logical timeline of the reactor and the physical time reported by the platform it runs on.

edwardalee commented 3 years ago

OK, I try here to explain logical deadline. To truly mirror physical deadline (or just deadline), logical deadline needs to be able to take an argument. With the following description, using the keyword logical deadline (finally!) makes sense to me:

A physical deadline d or just deadline d specifies an alternative reaction body b2 that should be invoked if the physical time at which the normal reaction b1 would invoked is greater than the logical time of the trigger by at least d.

A logical deadline d specifies an alternative reaction body b2 that should be invoked if the logical time at which the normal reaction b1 would be invoked would be greater than the logical time of the trigger by at least d. If d is not given, then it is assumed to be zero.

The Lingua Franca runtime normally assures that a reaction is always invoked exactly at the logical time of its trigger. So how could a logical deadline handler b2 ever be invoked? Barring some unforeseen bug in the implementation, there is only one circumstance: the trigger is an input port of a federate, the controller is set to decentralized, and one or more of the assumptions made by the decentralized controller has been violated.

lhstrh commented 3 years ago

Yes, this sounds right to me, except I think the target property we reserved was coordination (not controller).

Soroosh129 commented 3 years ago

A logical deadline d specifies an alternative reaction body b2 that should be invoked if the logical time at which the normal reaction b1 would be invoked would be greater than the logical time of the trigger by at least d. If d is not given, then it is assumed to be zero

I don't think this is correct for two reasons. First, in the case of decentralized logical connections, the reaction being invoked late is just an effect of a trigger being invoked late. Therefore, "the logical time at which the normal reaction b1 would be invoked would be greater than the logical time of the trigger" would be a critical bug in the runtime in all cases. Second, I think a d>0 in your definition is a critical failure no matter how you look at it. I originally thought that d could substitute the after on the connection(s) or be added to the after. To me, it makes sense to define the logical deadline of d to be:

A logical deadline d specifies an alternative reaction body b2 that should be invoked if the logical time at which an input trigger of the normal reaction b1 is triggered is later than the original intended logical time. This intended logical time is calculated as t_s + d_s + d.

To add to this discussion, one thing we haven't discussed yet is: In what order should the alternative reactions be invoked? Given a specific order such as logical deadline and then physical deadline, how should the effects be seen? Can the reaction handling a logical deadline violation trigger the original reaction like a physical deadline violation handler can? Should each be able to set outputs of the original reaction individually and cause side-effects, or should their effect be gathered (in this case, the physical deadline handler will generally have the dominant effect) and then acted upon?

lhstrh commented 3 years ago

I had suggested this:

Moreover, because these different kinds of deadline violations are orthogonal (both can happen at the same time), we should invoke the both handlers in case both violations happen simultaneously, and invoke them in declaration order.

Soroosh129 commented 3 years ago

I had suggested this:

Moreover, because these different kinds of deadline violations are orthogonal (both can happen at the same time), we should invoke the both handlers in case both violations happen simultaneously, and invoke them in declaration order.

Yes but what about the side effects? The following questions are still unanswered to me:

Can the reaction handling a logical deadline violation trigger the original reaction like a physical deadline violation handler can?

Should each be able to set outputs of the original reaction individually and cause side-effects, or should their effect be gathered (in this case, the physical deadline handler will generally have the dominant effect) and then acted upon?

lhstrh commented 3 years ago

Can the reaction handling a logical deadline violation trigger the original reaction like a physical deadline violation handler can?

Can it? How? I'm unaware of this capability.

Should each be able to set outputs of the original reaction individually and cause side-effects, or should their effect be gathered (in this case, the physical deadline handler will generally have the dominant effect) and then acted upon?

As per my understanding, the execution of these alternative reaction bodies is handled no different that the execution of a regular reaction.

Soroosh129 commented 3 years ago

Can the reaction handling a logical deadline violation trigger the original reaction like a physical deadline violation handler can?

Can it? How? I'm unaware of this capability.

I think I didn't describe the problem clear enough. Here is the comment for invoking the deadline reaction handler in reactor.c verbatim:

       // [...] Note that the violation reaction will be invoked
        // at most once per logical time value. If the violation reaction triggers the
        // same reaction at the current time value, even if at a future superdense time,
        // then the reaction will be invoked and the violation reaction will not be invoked again.

Here, the violation of the logical deadline is related to the input triggers of the reaction and not the reaction itself. If a reaction such as reaction(input_port, action) -> output is triggered again by schedule(action) in the body of logical deadline handler, the handler should technically be invoked again since the trigger is still late. But I sort of answered my own question here. In any case, the logical deadline handler should only be invoked only once for each reaction in the current logical time.

edwardalee commented 3 years ago

On Oct 18, 2020, at 2:08 PM, Soroush Bateni notifications@github.com wrote:

A logical deadline d specifies an alternative reaction body b2 that should be invoked if the logical time at which an input trigger of the normal reaction b1 is triggered is later than the original intended logical time. This intended logical time is calculated as t_s + d_s + d.

Yes, this was my original interpretation as well. However, this interpretation is different enough from a deadline that I really don’t think we should use the same keyword for both if we take this interpretation. The interpretation I proposed is a (somewhat tortured) way to justify using the same keyword.

Edward

lhstrh commented 3 years ago

Another way to describe the error condition we're discussing is divergence because there is a disparity between the intended tag of an input and the logical time at which it triggers the reaction. I think it also makes sense that under normal circumstances divergence should not take place, and that in a non-federated setting it isn't even possible. I think the term divergence contrasts nicely with deadline.

edwardalee commented 3 years ago

Divergence to me implies going to infinity (in technical usage) and branching off (in non-technical usage). Neither one seems right to me.

lhstrh commented 3 years ago

How about discrepancy?

the state or quality of being discrepant or in disagreement, as by displaying an unexpected or unacceptable difference; inconsistency: The discrepancy between the evidence and his account of what happened led to his arrest.
an instance of difference or inconsistency: There are certain discrepancies between the two versions of the story.

edwardalee commented 3 years ago

Makes no mention of time... "time discrepancy" could work. Or "logical time discrepancy". But these are rather verbose.

lhstrh commented 3 years ago

Now, that seems something that could be clarified in the programming manual... This situation is only applicable in a federated execution using Ptides, in which context it should be clear that inconsistencies arise from the violation of assumptions about time.

lhstrh commented 3 years ago

Paraphrasing the definition that @Soroosh129 wrote, we could say:

A discrepancy arises when a reaction is triggered by an input that originates from another federate and has an intended tag that does not match the current logical time. The intended tag is calculated as t_i = t_s + d_s + d where t_s is the logical time at the source when the message was injected into the network, d_s is deadline specified for the reaction that injected the message into the network, and d is the logical delay along the connection specified using after. Specifically, a discrepancy occurs when t_i <= t_d, where t_d is the logical time at the destination when the message is received, in which case the event is observed not at the intended tag, but at the earliest possible tag that is strictly greater than t_d. A discrepancy handler (similar to a deadline miss handler) specifies an alternative reaction body b2 to be invoked instead of the normal reaction body b1 when a discrepancy occurs.

edwardalee commented 3 years ago

Yes, this documentation is fine. The code itself doesn't hint at what is involved, however, so I still prefer "tardy", which has equally clear documentation.

lhstrh commented 3 years ago

Except for the fact that if you see tardy next to deadline there is no way of telling what the difference is. Both seem to refer to time. OK, now what?

As @Soroosh129 mentioned, "tardiness" refers to the amount of time by which a deadline was missed, which is a relevant piece of information to have in an ordinary deadline handler. This quantity is not the same as the amount by which an intended tag differs from the one that it was assigned, let's call this quantity "disparity." When triggered, a reaction could both have missed a deadline and be witnessing a discrepancy. In that case, there are two distinct quantities that are relevant: tardiness and disparity. For this reason alone, I'm convinced that we should not use the term "tardy" to deal with inconsistent tags.

edwardalee commented 3 years ago

I don’t have a use case for a deadline on a reaction to incoming network messages. Do you? So this might not really be an issue. I think that any programmer who chooses to do this will have to understand the two mechanisms extremely well anyway, so it won’t matter much what they are called.

lhstrh commented 3 years ago

I don’t have a use case for a deadline on a reaction to incoming network messages. Do you?

Sure. Consider a reaction that depends on an upstream federate as well as a contained reactor; it can encounter a discrepancy because the of the former and experience a deadline miss because of the latter.

I think that any programmer who chooses to do this will have to understand the two mechanisms extremely well anyway, so it won’t matter much what they are called.

As you pointed out, we don't even need to name it at all. We can use a symbol. But if we do use a human-readable name, then the term will inevitably appear in written text such as documentation and papers, in which case it matters a lot for the clarity of the exposition what words are being used. One of the joys of the terminology that we've established so far (which was at times a royal pain to figure out!) is that we can use it (e.g., "reactions", "triggers", "sources", "effects", etc.) easily in written text without all the time clarifying that we're talking about some reserved keyword with a separately defined meaning. Even without a precise technical definition, these terms make sense based on the unrefined preconceived understanding the average reader with have of them. I think this matters.

lf-lang / lingua-franca