Open byeonggiljun opened 11 months ago
I guess this is the federated/PingPongDistributed.lf
test case. In this case, I'm guessing that Pong
sends NET(FOREVER)
, which causes an earlier NDT(FOREVER)
to be sent to Ping
. This is why at (0,10)
it would choose to not send ABS
. Normally, this would not be a problem, but in this case, Ping
is also calling lf_request_stop()
. Assuming the stop time will be (0,10)
, this results in Pong
wanting to execute its shutdown reaction at (0,10)
. But it can't do this until it receives either an ABS
at (0,10)
from Ping
(or a TAG(0,10)
, but it won't get this because Ping
can't complete tag (0,10)
until it hears from Pong
at this tag).
The real problem here, I think, is that lf_request_stop()
invalidates the original contract of the NET(FOREVER)
message. Would it work to simply disable the optimization when lf_request_stop()
is called?
In a more general scenario, lf_request_stop()
might be called by some other reactor somewhere else in the system, which again will trigger this problem. But in this case, each federate should receive a message that stop is being requested, and it replies with a tag at which it is able to stop. Each such federate should disable this optimization at that tag.
Could a NET message later be invalidated by a physical action happening? Or would the presence of a physical action cause the NET not have a lower bound corresponding to the earliest possible occurrence of a physical action?
I think a physical action in an upstream federate prevents the NET(FOREVER) from being sent (it also results in a warning when using centralized control and a hint to set a parameter to control the frequency of interactions).
I think a physical action in an upstream federate doesn't cause the issue but a physical action in a downstream federate can invalidate a NET message.
I'm trying to figure out if there is another situation that causes invalidated NET messages.
A NET(g) message from reactor R is a promise that, absent any network inputs, R will not produce an output with a tag less than g.
I don't see how anything downstream of R, including any physical action, could possibly invalidate this.
Note that there is nothing special about g = FOREVER, so FOREVER should not be treated as a special case.
A message from upstream could result in a new NET(g') from R where g' < g. This is why the RTI calculates a transitive upstream NET before granting a TAG. The "transitive" aspect of this effectively calculates the least g' that might later be sent by R.
However, I discovered today that we are not calculating the transitive upstream NET correctly when there are after delays on the connections. The problem is the algorithm uses a "visited" flag to avoid examining an upstream reactor more than once. This is wrong because the least g' due to an upstream reactor R' depends not only on R's NET, but also on the path delays from R' to R. The current algorithm with the "visited" flag is assuming all paths have the delays.
I think this algorithm needs to be updated to use dynamic programming.
I think this is the cause for a bug report I recently submitted involving a fairly complex federated program where I couldn't find a way to trigger the bug with a simpler test.
A NET(g) message from reactor R is a promise that, absent any network inputs, R will not produce an output with a tag less than g.
In the example below, there is a physical action that triggers reaction_3 in Receiver. The user input invokes the physical action. And the trace below shows that NET can be invalidated in the current Lingua Franca.
The left is the RTI, the middle is Receiver, and the right is Sender. I used the 'master' and 'main' branches of LF and reactor-c to get this trace. After Receiver sends NET (2 sec, 0) (2 sec is the timeout time), the physical action schedules reaction_3 at (1,177,811,874 nsec, 0). This results in invalidating NET (2 sec, 0) by sending NET (1,177,811,874, 0).
Should NET (2 sec, 0) not be sent when there is a physical action?
Very Interesting. Technically, when Receiver sends NET(2 sec, 0), it does not violate the contract as stated because it produces no outputs at all. It may be that we need to strengthen the contract, but this could have other effects.
Does this actually create a problem for NDT? When Sender receives NDT(1.71 sec, 0), it will be in one of two states:
It's last completed time (time of the top-level enclave) is <= 1.71 sec. In this case, it can just update its threshold for skipping NET messages. It should also send a NET(1.8 sec, O) if it has completed the 1.7 sec tag (i.e., it is idle and the next event on its event queue has tag (1.8 sec, 0) >= 1.71 sec). I think the rule should be something like "send a NET(g1) upon receiving NDT(g2) if current tag g3 < g2 and g1 > g2." (with enclaves, g1 would have to be the transitive next event tag).
It's last completed time (time of the top-level enclave) is > 1.71 sec. In this case, it could just respond with an LTC(g), where g is its last completed tag. Alternatively, if it is idle, it could respond with NET(g'), where g' is the tag of the earliest event on its event queue (with enclaves, this would have to be the transitive next event tag). This should then suffice for the RTI to grant a TAG(1.71 sec, 0) to Receiver.
If there is an after delay D on the connection, then the threshold used at Sender should be 1.7 - D.
Note that this will slow down the reaction to the physical action in Receiver. And an after delay won't help in this case, thereby nullifying the CAL theorem advantage of relaxing consistency. Consider for example the ADAS case. Slowing down the reaction at Receiver could be a really bad thing to do. So maybe this optimization should be disabled by default when the downstream reactor has a physical action?
Note that all these corner cases suggest we need a good formalism for reasoning about them. Ideally, we could prove theorems. This will be tricky because of the multiplicity of timelines. I would suggest first assuming perfectly synchronized clocks. Then all logical time relations become constraints on physical time relations with just one physical timeline.
Also, I'm seeing real potential for a nice Ph.D. thesis here... :-)
Thank you for sharing your insights, professor!
Does this actually create a problem for NDT?
In this example, it won't cause any problems for NDT. However, when there is a loop, it needs to be analyzed. I'm doing that.
So maybe this optimization should be disabled by default when the downstream reactor has a physical action?
Yes, it should be disabled when the downstream federate has a physical action. After the discussion with @lhstrh and @hokeun, we decided that each federate reports the presence of physical action to the RTI whenever it wants to join the federation.
Relevant Issue: https://github.com/lf-lang/reactor-c/issues/264 Relevant PR: https://github.com/lf-lang/reactor-c/pull/176
There is a need for saving skipped
ABS
(port absent) messages when we useNDT
to not send redundantABS
messages.ABS
messages? Let's consider the examplePingPongDistributed.lf
. At the tag (0, 10), there is no event, andreaction_1
of 'Ping' should send anABS
message. The trace below is an intended way to handleNDT
. The left side is the RTI, the middle one isPong
, and the left side isPing
.NDT (0, 10)
is received beforePing
advances its tag to(0, 10)
. Thus, it does not skipABS (0, 10)
.However, it's possible to receive
NDT (0, 10)
afterPing
advancing its tag to(0, 10)
. In that case, theport_absent_reaction
ofreaction_1
ofPing
already decided not to send anABS
message. As a result, we have to keep a skippedABS (T)
message until a federate (Ping
) completes the tagT
.ABS
messages not straightforward? It's not difficult to saveABS
messages for the current tag. However, when it comes to a connection with a delay, we have to save the futureABS
messages too. Thus, we have to save the tag, the destination federate, and the destination port of theABS
message. In regards to scalability, it can be a waste of memory.Currently, every
ABS
message is not skipped in the NDT PR even if it's not needed. We can reduceABS
messages too when we solve the problem I explained.