Open edwardalee opened 3 years ago
With centralized coordination, we have a potential problem that a federate could block for an unbounded amount of time on a network input after having been granted a PTAG. While we could use a deadline
to detect this, the detection will occur only after the network has been repaired, which could be quite late. The first solution proposed here could solve this problem and provide a much earlier fault detection. This application suggests that once the fault has been detected, the expected
input should henceforth be assumed absent at tag t
. If a message is later received with tag t
, the message can be discarded because the fault will have been handled.
This suggests that federated execution provides a clean alternative to the setjump/longjump solution given above. A long-running computation can be allowed to complete without blocking other federates, and once it completes, its outputs will be ignored (though not its state updates). If it never completes, then, as long as all downstream federates use this deadline trick, the fault handler may be repeatedly invoked. If the network gets partitioned, we get the same result. The fault handler is repeatedly invoked until the network is repaired. This seems to me like a very practical solution.
We met today (@hokeun , @lhstrh , and @edwardalee ) and figured a really nice enhancement to the deadline functionality that also plays beautifully with an extension to support LET. Consider this reactor (apologies, @Soroosh129 ):
reactor Soroush {
input start:bool;
output thesis:string;
reaction(start) -> thesis after 8 weeks {=
string draft;
while (!check_deadline(self)) {
draft += write_more();
}
SET(thesis, draft);
=} deadline (8 weeks) {=
info_print("Thesis done!");
=}
}
Notice the after
keyword, which declares a LET of 8 weeks. The output of this reactor will be produced 8 weeks after the input (in logical time). In addition, there is a deadline of 8 weeks, which means, first, that if the reaction is invoked more than 8 weeks late (in physical time) compared to the logical time of start
, then the deadline handler is invoked. But that's not the interesting part. The interesting part is the new function check_deadline(self)
, which checks physical time, and if physical time exceeds logical time by more than the deadline, that function invokes the deadline miss handler and returns true. Otherwise it returns false.
Hence, the above reactor implements an "anytime" computation. Soroush keeps writing until he runs out of time, then he files a thesis. :-)
Hence, the above reactor implements an "anytime" computation. Soroush keeps writing until he runs out of time, then he files a thesis. :-)
This is a very interesting proposal, but for some reason, reading it gives me anxiety :)
The interesting part is the new function
check_deadline(self)
, which checks physical time, and if physical time exceeds logical time by more than the deadline, that function invokes the deadline miss handler and returns true. Otherwise it returns false.
I have two questions about this proposal at the moment.
check_deadline
is called? Is it the logical time at the start of the reaction or whatever logical time the program happens to be at when the function is called?check_deadline
is breaking the mutual exclusivity of a reaction and its deadline handler that we have now. Is this okay for the semantics? In the example above, the main reaction body does not return when check_deadline
returns true. Instead, it will break out of the while loop and set the output. What if the deadline handler also sets the output?Ah, yes, these questions remind me of the key difficulty in implementing the LET concept, namely that we need to ensure that any reaction that has a non-zero LET cannot see the logical time advances around it that are occurring while it executes. Logical time should never advance during the execution of a reaction, so, conceptually, the after
keyword applies only to the effects (the output) and is semantically equivalent to using after
on any connection from those outputs.
As a result, this is actually quite easy to implement s.t. the semantics is preserved. However, if we want the reaction with LET to execute in parallel with other reactions, and to allow logical time to advance to advance around it during its execution, the implementation is harder. We have to figure out a way to freeze this one reaction's view of current logical time while, for reactions in other reactors, logical time can advance.
One way to do this would be change get_logical_time()
so that it is a method of the reactor. In C, this would be simulated by changing the calling signature to get_logical_time(self_base_t*)
, where the argument is a pointer to the self struct. The code generator could then set the logical time of the start of a reaction execution on the self struct in the preamble of the reaction body. It would then somehow signal the runtime that logical time can advance up to its LET completion time.
On the question of mutual exclusion, we do not propose to compromise on that. During the LET, this one reactor will be unable to react to any events (inputs, timers, or actions). @lhstrh suggested that the reactor have a property that specifies what to do with such events if they appear. By default, they would be deferred, reassigned a logical time equal to the completion time of the LET (or, if multiple of a single event appear, at an advanced microstep). Alternatively, the reactor could specify to drop such events. This is similar to the minimum spacing property of actions and could presumable be implemented in a similar way.
One way to do this would be change get_logical_time() so that it is a method of the reactor. In C, this would be simulated by changing the calling signature to get_logical_time(self_base_t*), where the argument is a pointer to the self struct.
This sounds like a reasonable solution to me. I think this solution in combination with the property that the reactor will be unable to react to any events (inputs, timers, or actions) for the duration of LET strikes a good balance. Having the option to specify a "drop" or a "defer" policy also sounds good to me.
It seems like with this proposed solution, removing the barrier synchronization on tag advancement would still be needed to allow for parallelism and preemptive scheduling across tags. I was thinking that, at least for reactor-c, the mechanism implemented in _lf_global_tag_advancement_barrier
(which prevents the runtime from advancing the tag beyond a certain horizon) could be used for this. The tag advancement barrier mechanism has to be modified to keep a list of tag barriers (each element of the list containing a horizon
tag and a number of requestors
for that horizon). This list would have to be sorted by tags in ascending order (pqueue_t
might be a suitable data structure for this).
Whenever a reaction is set to be executed that has a LET, there would be a tag advancement barrier raised for the current tag plus the LET for that reaction. Whenever the reaction is done executing, it could remove its barrier (which would decrement the requestors
for that tag). If an element of the barrier list reaches zero number of requestors
, it would be removed (if it is the head that is removed, the new head, if any, would become the next barrier). The tag advancement then can be attempted a lot more aggressively, for example after each reaction is done executing.
Would it be appropriate to create a discussion to keep track of the discussion and the design related to LET?
Would it be appropriate to create a discussion to keep track of the discussion and the design?
Absolutely. I think there are actually multiple discussion topics here.
I really like the syntax proposed by @edwardalee above! It's funny how we are going back to reactions consuming logical time, as this is something I implemented accidentally when I first implemented reactors in C++ and didn't fully understand the model yet.
I see the proposed syntax for LET tasks (reactions) as orthogonal to execution details and strategies for relaxing the barrier synchronization. Probably also "traditional" LF programs could benefit from the latter. I am also thinking lately about such strategies in the back of my had, but nothing concrete has fallen out so far. So it would be great to get a discussion started.
Just trying to summarize the issue of possible confusion with the name of check_deadline
for users discussed during today's LF weekly meeting:
So far, we discussed some potential options (could be more) to address the issue:
check_deadline
to something like check_and_invoke_deadline
(by @Soroosh129)check_deadline
for whether to invoke the deadline handler to make the user aware of the side effect. (by @lhstrh) Another shorter name that might help a bit would be deadline_checkpoint
. If we combine that with the boolean argument, it would become abundantly clear. I like the idea of a boolean argument because then it becomes easy to have two distinct ways of reacting to the two ways the deadline might be violated:
Another shorter name that might help a bit would be
deadline_checkpoint
. If we combine that with the boolean argument, it would become abundantly clear.
I like the idea of combining renaming and adding the boolean argument!
I think checkpoint
could perhaps be confused with the concept of application checkpointing, in which the state of the application is saved at a given point.
Another possibility could be to have two separate functions: check_deadline()
and check_and_invoke_deadline()
. But @lhstrh's idea of adding a second argument to check_deadline()
is a lot more compact.
1- The reaction is to be invoked too late, in which case the deadline handler is invoked. 2- The reaction runs too long, in which case the deadline handler is invoked only if the programmer requests it.
I think it's important to note that with the current design, 1 and 2 are checked using the same numerical value, for both a late release and a longer than expected execution time for a portion of the current reaction. I think these two concepts represent different things. This is not a problem if the goal of the deadline handler mechanism is to keep the total lag of the program in check (e.g., to keep it below 2 msec) and do something else if lag exceeds that value. I think the design of check_deadline()
is useful for this purpose. However, the programmer/designer might want to ensure that a reaction is not late by 2 msec, but they might also want to make sure that a portion of their reaction doesn't take more than 1 msec to execute. In this case, check_deadline()
would not help as-is.
I don't immediately see a use case where two different deadlines would be needed, but if there is such a use case, it is easily implemented by just checking physical time in the body of the reaction directly against some arbitrary other deadline. So I don't think we really need to add anything to the infrastructure to support this.
it is easily implemented by just checking physical time in the body of the reaction directly against some arbitrary other deadline
Yes, that makes sense. However, to play the devil's advocate, both the check_deadline()
function and the deadline handler in LF continue to be a measure to curb the total end-to-end lag in the program, not necessarily what a real-time systems researcher would call a deadline. We've heard this from people in that community a few times.
I added a boolean argument invoke_deadline_handler
to check_deadline()
in https://github.com/lf-lang/lingua-franca/pull/976 and https://github.com/lf-lang/reactor-c/pull/46 as I discussed with Edward (@edwardalee). Please let me know if anyone has any suggestions!
The deadline construct in LF is capable of detecting deadline violations only after the fact. If a reaction runs longer than expected, this can be detected by downstream reactor that receives an input from an output produced by that reaction, but the detection occurs only after the upstream reaction has completed execution. Suppose instead that we want to react as soon as the upstream reaction has exceeded its time budget? One way to do this might be with the following syntax:
Notice the additional argument to the
deadline
keyword, which is the input portexpected
. The meaning of this could be as follows:Let t be the logical time and T be the physical time at which the
trigger
reaction fires. If T > t + 10 msec, then the PANIC body is invoked, as it would be for an ordinary deadline instruction. Otherwise, the ordinary reaction is invoked, but in addition, a physical timer is set to expire after 10 msec - (T - t). When that physical timer expires, a reaction is put on the reaction queue at highest priority. When that executes, it checks to see whether theexpected
input is known (absent or present), and if not, it invokes the PANIC reaction. It would also have to check whether logical time has advanced past the one at which the timer was started, in which case, there is no need to PANIC.This seems kind of complicated, but I think it would work, at least in a multithreaded environment that either has multiple cores or has an underlying preemptive priority-driven thread scheduler.
Why is it so complicated? First, I think the detection has to be done in a separate reactor from the one where the violation occurs. Suppose we were to try this:
In this case, the PANIC reaction body would be invoked concurrently with the Long-running computation, a violation of the mutual exclusion principle in Lingua Franca. To allow this violation, both the Long-running computation and the PANIC reaction bodies would have to mutexes to access outputs and state variables, which is a lot to expect from programmers and is impossible to enforce.
Instead, in my proposed solution, the Upstream reactor has no deadline clause, and the input that triggers its Long-running computation is simultaneously provided to the
trigger
input of the downstream reactor. As a consequence, this solution is not adversely affected by the barrier synchronization we currently perform when advancing time on each federate.This would only work with multithreaded execution.
Although I think this solution solves some problems, it does not solve all. For example, notice that logical time cannot advance until the Long-running computation completes, regardless of how long it takes. To solve that problem requires a more drastic addition to the language, namely a mechanism for aborting reactions while they are executing (e.g. using setjump/longjump). This is drastic: it would require repair of data structures storing any state of the system, and it would still require disciplined programming from users (e.g., to not allocate memory in reactions). If we want to support that, I suggest the following syntax:
This would use setjump/longjump to abort the Long-running computation after 10 msec and execute the PANIC body. This is dangerous, but with some user discipline, could work. I this case, I would suggest that on starting the Long-running computation, all state variables and output port states are cached before starting the Long-running computation and restored if and when the kill triggers. Any outputs that have been produced would be retracted before they trigger downstream reactions and state updates would be reversed. The keyword
kill
is violent, but not as triggering asabort
might be, and it's very descriptive of what will happen. But we could chose a less violent keyword, I suppose.Note that in my suggestion, the
kill 10 msec
clause would still be relating logical time to physical time. That is, the Long-running computation would be killed when physical time hits t + 10 msec, not when it hits T + 10 msec. Thus, this could be viewed as a variant of thedeadline
keyword that references the completion of the reaction execution rather than the start of the execution body.On reflection, the
kill
keyword seems rather easy to implement (easier than the first proposal), so perhaps we should start with that.