Best way to mix online and offline updates

cajun-rat commented 2 years ago

Offline updates (#8) will allow a single device to get update instructions from two sources:

The Director
Someone plugging a USB stick (or whatever) into the device

If only one or the other of these two happens, the behaviour of the system is pretty clear. However we need to consider what will happen in scenarios where the two are interleaved. I assume this to be rare in normal operation, so I think our goals are to introduce this without breaking other things, rather than to support fancy use cases.

I wanted to open this ticket so we can have this "what happens" discussion in the open rather than via email. Here are the considerations I'm aware of so far:

SotaUptaneClient splits the update into parts like 'Fetch Metadata' 'Download' etc. Those are probably the 'atomic' operations that we need to sequence, so this isn't about locking in SotaUptaneClient, rather whether it OK to fetch the metadata offline, then download / install online? I think the answer to this question is 'no', but there are more marginal cases to consider.
The Director currently assumes that once it has handed out a Correlation ID, Aktualizr will always report it back in a manifest with either success / failure. How does this work when the user plugs in a USB stick?
If the user leaves a USB stick plugged in, does that block online updates?
It seams reasonable that there are no failure modes of online updates that can indefinitely block an offline update, and vice-versa.
Whatever the rules are, we need to succinctly define them in order to answer questions like "Is this even meant to work at all?"
- A simple implementation would be less work and less maintenance
- We need clear invariants to avoid this being a source of security holes.

I hope we can work out a proposal for what these rules should be in this ticket!

pattivacek commented 2 years ago

Sounds like a good idea! I'll share my first reactions:

whether it OK to fetch the metadata offline, then download / install online? I think the answer to this question is 'no', but there are more marginal cases to consider.

I think we should prevent mixing and matching. That sounds like it could open up risky doors and create problems. I'm not sure how best to prevent that, though. I can also see a case for more or less interrupting an online update that's blocking or broken somehow with an offline update that might restore things to a better place.

The Director currently assumes that once it has handed out a Correlation ID, Aktualizr will always report it back in a manifest with either success / failure. How does this work when the user plugs in a USB stick?

That I really don't know, but the offline updates don't need a correlation ID, right? What happens if we just leave that out of the equation? Will the Director get particularly upset that the image changed without informing it?

If the user leaves a USB stick plugged in, does that block online updates?

Probably. How else do we know how to prioritize or trigger the offline updates? That's the deeper question, right?

It seams reasonable that there are no failure modes of online updates that can indefinitely block an offline update, and vice-versa.

That seems like a good goal and might even be a given.

tkfu commented 2 years ago

The offline updates definitely don't need (or have) a correlation ID. As for director behaviour, I think we should get @simao in this thread (I think github won't let me actually ping him via @ until he joins of his own accord). I think it's technically device registry that keeps track of device state, but state changes are triggered by events emitted by director. Here's the state diagram as I understand it:

stateDiagram-v2
  NotSeen --> UpToDate: device sends manifest
  UpToDate --> Outdated: user assigns an update
  Outdated --> UpdatePending: device downloads director metadata w/ correlationId
  Outdated --> UpToDate: user cancels assigned update
  UpdatePending --> UpToDate: device reports success w/ correlationId
  UpdatePending --> Error: device reports failure w/ correlationId
  Error --> Outdated: user assigns an update

(by the way: isn't it cool that github added support for inline mermaid diagrams in markdown??)

Note that both transitions out of UpdatePending can only happen when we get a manifest from aktualizr containing an InstallationReport with correlationId in its custom metadata. (Director throws an error if you attempt to assign an update to a device that is currently Outdated or UpdatePending.)

On principle, I think that the decision about whether to interrupt an in-progress update with an update of another type should be left to the user of libaktualizr: the calling program should make decisions about the circumstances under which an update should be preempted. But that leaves us to decide how to construct the manifest (or specifically, the installation report) in the case where an online update is interrupted by an offline update.

Today, an installation report always pertains to a correlationId. If we were to just include an installation report for the failed online update, there are two unsatisfactory options: call it a success, even though it was interrupted, or call it a failure and have the device be in an "Error" state on the server even though it's successfully installed an offline update.

I think in an ideal world, we would also be able to send installation reports for the successful install of an offline update, the next time the device sends a manifest to director (I'd imagine that the offline update installation report would use the name and version of the offline update metadata in lieu of correlationId). That does imply, though, that it would have to be possible to include more than one installation report per manifest. And if we can have multiple installation reports, then we also solve the problem of how to report an interrupted online update: it's just a failure, but there's an offline update installation report that comes after it. Of course, director would also have to learn to accept multiple installation reports in aktualizr manifests.

simao commented 2 years ago

Yes, we are not sending any correlation ids in the lockboxes, I don't think that is needed, and we also don't have it at the time the user creates the lock box.

Regarding the device reporting to director something that was installed offline, director will still process the manifest.

If the device has some assignment in director already, an online update, director will still process the manifest and just update the currently installed ecu target, put keep the device queue intact. If the offline installed update somehow matches the update in the queue (ecuId, filename, checksum), and that assignment in director has a correlation id, then the normal execution for online updates will be followed. If the offline update does not match the online update, then the processing will be the same as when the queue is empty, and the device queue will remain as before.

If the queue for the device is empty, director will just update the current known installed ecu target to the new target reported by device. If the manifest reports an error, this will be published in the bus, and device registry will update the status to error, otherwise the status will be moved to Up To Date.

tkfu commented 2 years ago

Ok, I think that covers my concern about getting into an non-updateable state: director will just attempt to re-send whatever update was assigned before. Whether that's desirable or not is something that we can discuss separately; it's only a concern for director.

I'm also convinced, after talking to @simao about it, that we probably don't need aktualizr to send us an installation report for each of the updates it attempted to install since the last time it sent a manifest. The most recent one should be sufficient. I do think we should probably define some way for aktualizr to note that it was an offline update, so that we can display that info to the user (and allow director to make a decision about whether to cancel the pending update or not). Something like setting the correlationId to offlineUpdateName:offlineUpdateVersion, or just adding a new field to the installation report with that info.

simao commented 2 years ago

In the case of offline updates, the device will not have a correlation id, but it could build one like urn:here-ota:offline-update:<some-id> (All others are like urn:here-ota:auto-update:<uuid>, urn:here-ota:mtu:<uuid> etc. Though this needs some backend changes to handle correctly.

uptane / aktualizr

Best way to mix online and offline updates #74