lf-lang / reactor-c

A reactor runtime written in C
Other
10 stars 24 forks source link

Safely skipping redundant port absent messages in the centralized federated execution #295

Open byeonggiljun opened 11 months ago

byeonggiljun commented 11 months ago

Relevant Issue: https://github.com/lf-lang/reactor-c/issues/264 Relevant PR: https://github.com/lf-lang/reactor-c/pull/176

There is a need for saving skipped ABS (port absent) messages when we use NDT to not send redundant ABS messages.

image

However, it's possible to receive NDT (0, 10) after Ping advancing its tag to (0, 10). In that case, the port_absent_reaction of reaction_1 of Ping already decided not to send an ABS message. As a result, we have to keep a skipped ABS (T) message until a federate (Ping) completes the tag T. image

Currently, every ABS message is not skipped in the NDT PR even if it's not needed. We can reduce ABS messages too when we solve the problem I explained.

edwardalee commented 11 months ago

I guess this is the federated/PingPongDistributed.lf test case. In this case, I'm guessing that Pong sends NET(FOREVER), which causes an earlier NDT(FOREVER) to be sent to Ping. This is why at (0,10) it would choose to not send ABS. Normally, this would not be a problem, but in this case, Ping is also calling lf_request_stop(). Assuming the stop time will be (0,10), this results in Pong wanting to execute its shutdown reaction at (0,10). But it can't do this until it receives either an ABS at (0,10) from Ping (or a TAG(0,10), but it won't get this because Ping can't complete tag (0,10) until it hears from Pong at this tag).

The real problem here, I think, is that lf_request_stop() invalidates the original contract of the NET(FOREVER) message. Would it work to simply disable the optimization when lf_request_stop() is called?

In a more general scenario, lf_request_stop() might be called by some other reactor somewhere else in the system, which again will trigger this problem. But in this case, each federate should receive a message that stop is being requested, and it replies with a tag at which it is able to stop. Each such federate should disable this optimization at that tag.

lhstrh commented 11 months ago

Could a NET message later be invalidated by a physical action happening? Or would the presence of a physical action cause the NET not have a lower bound corresponding to the earliest possible occurrence of a physical action?

edwardalee commented 11 months ago

I think a physical action in an upstream federate prevents the NET(FOREVER) from being sent (it also results in a warning when using centralized control and a hint to set a parameter to control the frequency of interactions).

byeonggiljun commented 11 months ago

I think a physical action in an upstream federate doesn't cause the issue but a physical action in a downstream federate can invalidate a NET message.

I'm trying to figure out if there is another situation that causes invalidated NET messages.

edwardalee commented 11 months ago

A NET(g) message from reactor R is a promise that, absent any network inputs, R will not produce an output with a tag less than g.

I don't see how anything downstream of R, including any physical action, could possibly invalidate this.

Note that there is nothing special about g = FOREVER, so FOREVER should not be treated as a special case.

A message from upstream could result in a new NET(g') from R where g' < g. This is why the RTI calculates a transitive upstream NET before granting a TAG. The "transitive" aspect of this effectively calculates the least g' that might later be sent by R.

However, I discovered today that we are not calculating the transitive upstream NET correctly when there are after delays on the connections. The problem is the algorithm uses a "visited" flag to avoid examining an upstream reactor more than once. This is wrong because the least g' due to an upstream reactor R' depends not only on R's NET, but also on the path delays from R' to R. The current algorithm with the "visited" flag is assuming all paths have the delays.

I think this algorithm needs to be updated to use dynamic programming.

I think this is the cause for a bug report I recently submitted involving a fairly complex federated program where I couldn't find a way to trigger the bug with a simpler test.

byeonggiljun commented 11 months ago

A NET(g) message from reactor R is a promise that, absent any network inputs, R will not produce an output with a tag less than g.

In the example below, there is a physical action that triggers reaction_3 in Receiver. The user input invokes the physical action. And the trace below shows that NET can be invalidated in the current Lingua Franca. SparseSenderPhysical

The left is the RTI, the middle is Receiver, and the right is Sender. I used the 'master' and 'main' branches of LF and reactor-c to get this trace. After Receiver sends NET (2 sec, 0) (2 sec is the timeout time), the physical action schedules reaction_3 at (1,177,811,874 nsec, 0). This results in invalidating NET (2 sec, 0) by sending NET (1,177,811,874, 0). image

Should NET (2 sec, 0) not be sent when there is a physical action?

edwardalee commented 11 months ago

Very Interesting. Technically, when Receiver sends NET(2 sec, 0), it does not violate the contract as stated because it produces no outputs at all. It may be that we need to strengthen the contract, but this could have other effects.

Does this actually create a problem for NDT? When Sender receives NDT(1.71 sec, 0), it will be in one of two states:

  1. It's last completed time (time of the top-level enclave) is <= 1.71 sec. In this case, it can just update its threshold for skipping NET messages. It should also send a NET(1.8 sec, O) if it has completed the 1.7 sec tag (i.e., it is idle and the next event on its event queue has tag (1.8 sec, 0) >= 1.71 sec). I think the rule should be something like "send a NET(g1) upon receiving NDT(g2) if current tag g3 < g2 and g1 > g2." (with enclaves, g1 would have to be the transitive next event tag).

  2. It's last completed time (time of the top-level enclave) is > 1.71 sec. In this case, it could just respond with an LTC(g), where g is its last completed tag. Alternatively, if it is idle, it could respond with NET(g'), where g' is the tag of the earliest event on its event queue (with enclaves, this would have to be the transitive next event tag). This should then suffice for the RTI to grant a TAG(1.71 sec, 0) to Receiver.

If there is an after delay D on the connection, then the threshold used at Sender should be 1.7 - D.

Note that this will slow down the reaction to the physical action in Receiver. And an after delay won't help in this case, thereby nullifying the CAL theorem advantage of relaxing consistency. Consider for example the ADAS case. Slowing down the reaction at Receiver could be a really bad thing to do. So maybe this optimization should be disabled by default when the downstream reactor has a physical action?

Note that all these corner cases suggest we need a good formalism for reasoning about them. Ideally, we could prove theorems. This will be tricky because of the multiplicity of timelines. I would suggest first assuming perfectly synchronized clocks. Then all logical time relations become constraints on physical time relations with just one physical timeline.

Also, I'm seeing real potential for a nice Ph.D. thesis here... :-)

byeonggiljun commented 11 months ago

Thank you for sharing your insights, professor!

Does this actually create a problem for NDT?

In this example, it won't cause any problems for NDT. However, when there is a loop, it needs to be analyzed. I'm doing that.

So maybe this optimization should be disabled by default when the downstream reactor has a physical action?

Yes, it should be disabled when the downstream federate has a physical action. After the discussion with @lhstrh and @hokeun, we decided that each federate reports the presence of physical action to the RTI whenever it wants to join the federation.