remote-observability problem on state update messages?

a112358132134 commented 2 years ago

With the current API, I understand there are two methods for obtaining the state of a device:

Wait for an appropriate mqtt message, i.e.:
- StateUpdate for DE1 operational states
- ConnectivityChangeNotification for DE1 and scale connectivity states
Request a state update via http GET, i.e.:
- de1/state/state/state
- de1/state/state/substate
- de1/connectivity/mode
- scale/connectivity/mode

In either case, I think there might be a possible race condition (from the consumer's perspective) between the mqtt and http alternatives above. As such I've raised this ticket for discussion.

Example of a problematic situation

The consumer application ('app') requests, via http, a status update (state 'A')
Very shortly afterwards, the DE1 changes states to state 'B', notifies pyDE1, which in turn broadcasts an mqtt message reflecting this
Due to the different stacks, protocols etc. involved with http and mqtt, these ultimately get to the app and are processed in the wrong order, i.e. 'B' is seen before 'A'.
This cannot be identified and corrected for by the app, and hence the app incorrectly believes the state is currently state 'A'

Why this race condition is important

Most of the basic app functions are critical to be in sync with the DE1 (and scale). If a misalignment were to occur, given functionality may not be presented to the user to even be able to trigger a subsequent change of state and have everything realign itself - at least not without restarting one or more software or hardware components involved.

In the worst case example, a dangerous or problematic function (e.g. flush, descale) is in operation, but the consumer app is unable to do anything about it due to the misalignment. Admittedly, this is more of a concern on DE1's without a group head controller.

Mitigation options

App end: In addition to both the expected mqtt processing and explicit interface-driven http requests, poll pyDE1 on a regular basis (e.g. 1 or 2 Hz frequency) for the current state(s) to make sure the app and the backend do not remain out of alignment.
API end: Have all http reponses, or at least those for states, also include the same generic identifying information that mqtt packets do, specifically the event time (side note: if you're doign this, also an opportuntity to include a version)
API end: Have an ad-hoc status request via http, rather than providing the result via the response, instead trigger an ad-hoc StatusUpdate mqtt message to be published to avoid the misalignment. This should be sent from pyDE1 in order, and even if something else happens after that, the mqtt message has the event time field to allow the app to determine if the message is still relevant.

One or more of these options could be implemented.

Thoughts?

jeffsf commented 2 years ago

This isn't a "race condition" but is a classic remote-observability problem when there are multiple channels of information.

Not a bug.

The primary source of state are the MQTT packets. They are all time stamped if you think it is valuable to confirm that they have been received in the order sent.

The queries over HTTP are mainly to be able to capture a good-enough approximation of current state to initialize or resynchronize a client.

Edit: Of the proposed mitigation approaches, sending a set of MQTT updates on request was previously considered and rejected. It was not chosen as it complicates all clients, requiring them to keep track of their current sense of state to determine if a change of state has been reported. Not only is this a burden on client design, but introduces yet another point of potential loss of synchronization.

If you can provide a concrete example where this is truly a problem that impacts the ability of a user to accomplish a meaningful use case, please do so. Otherwise I will close this shortly.

a112358132134 commented 2 years ago

This isn't a "race condition" but is a classic remote-observability problem when there are multiple channels of information.

Apologies for the poor language choice - I've been out of software for a few decades, and struggle to articulate the concepts given how much I've forgotten.

The primary source of state are the MQTT packets. They are all time stamped if you think it is valuable to confirm that they have been received in the order sent.

Yes, I do verify the timestamps on all incoming mqtt packets. On one occasion, during rapid and repeated state-change stress testing, I did observe two mqtt packets that were processed out-of-order due to the vagaries of thread scheduling. This was gracefully dealt with by the time check logic; the older data was inserted into the historic record at the appropriate point to flesh out the datasets for graphing, but the app's view of the current state was not changed.

The queries over HTTP are mainly to be able to capture a good-enough approximation of current state to initialize or resynchronize a client. If you can provide a concrete example where this is truly a problem that impacts the ability of a user to accomplish a meaningful use case, please do so. Otherwise I will close this shortly.

The specific case I'm worried about is an unsafe function (water, flush, espresso, steam) being activated physically, but the user being unable to stop it due to a desync. This is not a concern for a DE1 with a Group Head Controller, but I don't have one of those.

As an aside, I have had such misalignments with the official decent app on several occasions over the last couple of years, most recently with a Flush function this morning that I could not stop in the app.

I have seen one state misalignment in my app with pyDE1 through testing (app thought the DE1 was awake, but it was actually asleep and hence couldn't be woken without outside intervention). Unfortunately I was only logging exceptions or odd cases with the incoming mqtt packets, so I can't precisely tell if mqtt vs http packet timing misalignment was the root cause, although it was most likely an error in the state logic in my app. Regardless, it led me to review how I was utilising pyDE1's API which resulted in this concern being raised.

On the possible mitigations I listed:

Infrequent http polling by app - With further thought, I now think this is mandatory regardless if any other mitigations are done. This is because the mqtt QoS policy is set at 0 [aside: I agree with this QoS choice]. As such, there is a (very remote) possibility that a state update is missed, which for safety reasons, needs a backstop. However, technically this actually increases the possibility of an mqtt/http reponse misalignment, if still very rare.
Ad-hoc requests returned via mqtt - Agree with your comments that this is not a great solution, but I felt I should put it forward regardless as it looked lower impact to pyDE1.

I'll leave the rest to your judgement.

jeffsf commented 2 years ago

The DE1 appears to send out a StateUpdate on connection, so pyDE1 should have current information nearly immediately after connection.

Issues with the de1app should be filed with their GitHub repo.

A future commit will enable the checks for sending an Idle request to be overridden through config.

jeffsf / pyDE1