Message protocol enhancements for scheduled execution

glopesdev commented 6 months ago

[x] Proposed
[x] Prototype: https://github.com/harp-tech/atxmega/commit/fbe5f26e51bedeaa42d0473bf2168a78dd7f2d7e
[ ] Implementation: Not Started
[ ] Specification

Summary

Interpret timestamped messages from controller to device as scheduled commands, where the timestamp represents the absolute device time at which the command should be executed. Cancellation is implemented by sending a copy of the scheduled command with a new bit flag in the MessageType field indicating a cancellation command.

Motivation

Time is at the center of Harp protocol communication. All messages from device to controller are timestamped (including command replies) to allow precise logging of all device state changes. The protocol design was intended to be symmetrical, so that all messages from controller to device use the exact same message structure, but there is currently no interpretation of what timestamped commands from controller to device mean.

We propose that these messages should be interpreted as scheduled commands, i.e. commands to be stored and executed in the future, at the time specified in the message timestamp. This proposal aims to address all currently identified edge cases, including late arrival of scheduled messages, changes to the device clock introduced by the synchronization protocol, and support for cancellation.

Detailed Design

Our approach to design was to leverage the existing standards to the utmost, by minimizing the number of required additions to both the binary protocol and common registers. The current design requires no new registers and no changes to the binary protocol other than a new bit flag in the MessageType field to support cancellation.

We outline the design by way of examples, described in detail below, with a discussion of how they work and examples of interaction.

Scheduled Write

A scheduled write is simply a timestamped Write message. The timestamp is to be interpreted in the reference frame of the device, the same as all messages from device to controller.

Upon reception of the command, the device shall store the message and wait for the specified time to be reached before executing the command. If upon reception the timestamp is already behind the device clock (or not specified), the command shall be executed as fast as possible. Replies to scheduled write commands are the same as regular commands, echoing the payload written to the register at the time of command execution.

Scheduled Read

A scheduled read is also simply a timestamped Read message. Upon reception of the command, the device shall store the message and sample the specified register at the specified time. Replies shall be the same as regular reads , containing the register value payload and timestamp of when the register was sampled.

Cancellation

Cancellation is supported simply by sending the exact same scheduled message payload with a new bit flag in the MessageType field. We propose this to be a stored in the 5th least significant bit (0x10).

We assume the pair (address, timestamp) is unique, i.e. that at most one scheduled command is allowed for a single register for the same time slot, so the binary format of a scheduled write message would be enough to find and retrieve which scheduled command to cancel.

Design Meetings

https://github.com/orgs/harp-tech/discussions/25

Related Issues

bruno-f-cruz commented 6 months ago

We should probably also consider meta-infrastructure around this functionality to be added to the core. For instance:

Is the board able to schedule messages/how many
Track number of scheduled messages
Dump scheduled messages
Ability to cancel all scheduled messages

bruno-f-cruz commented 6 months ago

Also referencing @Poofjunior suggested spec. She might want to revise it, but dropping the link here for reference.

https://github.com/orgs/harp-tech/discussions/14

Poofjunior commented 6 months ago

Overall this looks super solid. The ability to cancel looks good too.

Questions:

How do we indicate if we cannot schedule a message? We need to:

discriminate which message we are replying in error with. This includes discriminating from multiple scheduled messages written to the same register
consider whether it's appropriate to block the PC-to-device transaction for an unbounded amount of time to wait for success/failure to schedule/execute the message.
A couple options:
- Immediately reply with a WriteError with the timestamp containing the scheduled timestamp.
- Immediately reply with a WriteError with the Cancelled bit set and with the timestamp containing the scheduled timestamp.
- raise an Event saying that we have run out of space.

Is there any way for the PC to check how many bytes remain for scheduled messages? Possible implementations:

an Event that issues when the bytes remaining is below a certain threshold but has not run out of space
a dedicated core register from the unused block (address: 16-31) that tracks how many remaining bytes we have left.

General Remarks

If we're reserving bits in the MessageType to "subtype" them, in general, I recommend we do them starting from the most-significant-bit first and working downwards. That way, we have plenty of bit space for more types of messages.

glopesdev commented 6 months ago

@bruno-f-cruz @Poofjunior thanks for the feedback. Indeed this proposal is an attempt to consolidate the original discussion thread and notes from our meeting a few months ago. I think if we polish this proposal until this next meeting we can have something ready for full review and potentially approved for inclusion, which would be amazing.

How do we indicate if we cannot schedule a message?

Immediately reply with a WriteError with the Cancelled bit set and with the timestamp containing the scheduled timestamp.

This is definitely my preferred option and I will update the specification with this. My only slight clarification is that I would return the exact message payload, which could be either a Write or Read message, i.e. if the command cannot be scheduled we would just set the Cancelled and Error flags and return the payload back to the sender.

The big advantage of this approach is that it gives us the possibility to easily introduce special operators to filter and handle scheduling errors separately from other errors.

For this to be unambiguous with the rest of the protocol semantics, we should also require that cancellation requests should never return an error, i.e. they are best-effort and idempotent (e.g. cancellation requests are ignored if the request has already been executed or cancelled). This is usually how it is implemented in most async frameworks anyway to prevent tying up complex async logic in knots and race conditions over cancellation. If we agree to make this assumption then we would know that an error message with the Cancelled flag HIGH always means a scheduling error.

I would add a final clarification that making a scheduled Read or Write to an invalid register must return a full error with the Cancelled flag LOW. My argument is that this failure mode constitutes a lower-level invalid register access error so it doesn't even get to be classed as a scheduled request. In fact, I would extend this to invalid payload types and in general to any validation checks on messages. It would be much more useful to fail these early and make sure we don't wait for commands that are doomed, e.g. I am imagining the frustration of waiting 1h for a register write that then just fails with the wrong numeric range. However, I am happy to discuss / relax this if early validation would be too costly or restrictive to implement.

Is there any way for the PC to check how many bytes remain for scheduled messages?

In the current proposal this is not possible. This was a deliberate design decision as I have a couple of reservations re. exposing any kind of capacity counters:

I can imagine implementations of scheduling that do not rely on fixed size queues, e.g. when creating a device in software we can easily keep a dynamic data structure that grows with requests up to RAM size. We could of course hack and return some kind of "infinity" value but it feels somehow that we are placing too many assumptions on how the schedulers are expected to work;
In addition to constraints on the device implementation, I am even more concerned that this might lead to misleading assumptions on how clients should work with Harp devices supporting scheduling, by having clients that try to "guess" ahead of time whether they can schedule messages or not. I actually feel this will lead not just to unnecessarily more complicated logic but also great potential for race conditions and unexpected behavior, since we will not be able to guarantee "atomicity", i.e. even if you checked there is free space now, that gives you no guarantees that there will be space when you actually make the request. I feel it is actually better in this case to just be reactive and ask for forgiveness when a scheduling error occurs (if necessary). If preemptive scheduling of multiple messages is required, then I would have client implementors simply consult the device datasheet and devise an appropriate throttling strategy.

Given the above my inclination would be to try first a couple of implementations using this purely abstract proposal and see whether in practice we really need to know about capacity. We can also always add device-specific registers with this functionality if necessary.

If we're reserving bits in the MessageType to "subtype" them, in general, I recommend we do them starting from the most-significant-bit first and working downwards. That way, we have plenty of bit space for more types of messages.

This is a good idea, and I am happy to discuss a general strategy for this, especially since there is another bit flag being considered for addition (i.e. the sync flag), and for this particular proposal there is no specific requirement as to where in the message type field the Cancelled flag is placed.

glopesdev commented 6 months ago

P.S.: This discussion somehow also made me think of a potential use case for scheduled Event messages. Basically this could be useful in the context of Harp hubs / routers, where events coming from other boards might be delayed / scheduled for broadcasting. A stretch, but potentially useful in the long-run.

glopesdev commented 6 months ago

Feedback from SRM meeting:

Application-specific write errors should be returned at the time of executing the command (e.g. invalid data)
Need to test all edge cases, think about escape hatches in the spec to allow for partial implementations
Consider devices that do not support cancellation
Emergency stop to cancel all scheduled messages
- Use a bit in OP_CTRL to do this:
- would allow clearing the queue while keeping device inactive or active

glopesdev commented 3 months ago

Emergency stop to cancel all scheduled messages

Use a bit in OP_CTRL to do this:

would allow clearing the queue while keeping device inactive or active

This suggestion is relevant to the discussion in #8.

glopesdev commented 3 months ago

@filcarv has developed a first prototype of what this enhancement might look like so we can get a better feel for whether the specs are workable. A few considerations which came out from this first attempt:

Scheduling failure

One clarification that was needed on the above specification was to decide which timestamp to send on the reply if there is a failure to schedule a message. To make the reply more informative for debugging, and for consistency with other message replies, we opted for returning the timestamp of the moment the reply is sent, but we can review this later if there are better options.

There was also a discussion on whether to set the Error flag when a message cannot be scheduled by the device. There are two possible interpretations:

If a message cannot be scheduled, this is an application-critical error at the same level as writing the wrong payload, or writing to the wrong register. In this case the Error flag should be set.
Alternatively, if a message cannot be scheduled, we can interpret this as automatic cancellation of the command by the device, and the application is notified as if the message had been cancelled. In this case the Error flag would not need to be set.

The problem with the former is that currently by default any unhandled errors immediately terminate the connection to the device and throw an application-level exception. This means that to be able to handle scheduling failures, we would need to set the IgnoreErrors flag in every device node dealing with scheduled message execution, and also ensure we perform error handling of all other errors coming from the device, which would be very impractical for the majority of applications.

For these reasons in the first prototype we decided to go for the latter interpretation and not set the Error flag if scheduling fails.

Cancellation bit

If we're reserving bits in the MessageType to "subtype" them, in general, I recommend we do them starting from the most-significant-bit first and working downwards. That way, we have plenty of bit space for more types of messages.

While this would have been a good idea in the early design of the protocol, we are unfortunately already compromised here since the Error flag itself is placed in the 4th least significant bit (0x08) so any hope of expanding the MessageType enum to adjacent bits would not be currently workable without a breaking change. For this reason we went for the original proposal and placed the Cancellation flag in the next bit (0x10). This way we could at least in principle store another short enum in the most significant bits.

For future reference, we do have one "free" type category since message type 0x0 is currently undefined.

High-level interface

To make working with both the cancellation and error flags easier in the high-level interface, a new enum MessageFlags was introduced, defined as follows:

[Flags]
public enum MessageFlags : byte
{
    Error = 0x08,
    Cancellation = 0x10
}

Access to the message flags can then be exposed as a property:

public MessageFlags MessageFlags => (MessageFlags)(MessageBytes[0] & 0x18);

Finally, overloads were added to Format and Harp message constructors to make it easy to set arbitrary message flags from existing message payloads, e.g.:

static HarpMessage FromPayload(int address, MessageType messageType, MessageFlags messageFlags, PayloadType payloadType, params byte[] payload)

glopesdev commented 3 months ago

A relevant edge-case raised by @bruno-f-cruz is to allow for possibly overriding a scheduled command payload, by resending the scheduled timestamped command after it is already scheduled. At the very least we need to decide what to do when the host sends the exact same scheduling key (address, timestamp) to the device. The options seem to be:

override existing payload (and possibly cancel existing one?)
cancel the new command by returning the payload with cancellation flag set
return an error

harp-tech / protocol