Improve backend-firmware clock synchronization

ethanjli commented 2 years ago

Currently it's unclear if physical (wall clock) time synchronization will work past 40 days of ventilator unit uptime without system restarts. Additionally, clock synchronization is only implemented for the event log, and it's not clear whether this implementation can be generalized in a way that clocks are synchronized consistently across different messages or message pairs. It may be a better design to have an explicit request/response pair for system-level clock synchronization rather than message-level clock synchronization.

The current implementation of the clock synchronization algorithm uses an improvised algorithm which is not necessarily as accurate as possible but is probably good enough. It may be better to use a published (but still simple) clock synchronization algorithm, such as the synchronization algorithm used in PTP (see further discussion here), with the firmware acting as a time server (since it needs to produce messages with timestamps before it can receive the date from the backend, and since any messages it sends with timestamps are effectively T1 broadcasts). The synchronization error of these algorithms is equal to half the difference between the delays in sending to the remote peer and receiving from the remote peer. If we put these messages on the Event Synchronization protocol, that delay is variable (depending on the number of other active events to be sent) and not necessarily symmetric; the upper bound on the error in time offset estimation is half of the round-trip time between the peers. In the best case where no other events are active on either peer, the round-trip time may range from 0 ms to 60 ms (for an upper bound on error of +/- 30 ms, which is good enough for us); but in challenging conditions with many simultaneously active events the delay may be as high as 1 s or even higher (which is really bad); so we should probably ignore/cancel synchronization attempts where the round-trip time is greater than some threshold (e.g. 30 ms, if we find that we don't need to ignore too many synchronization attempts). If this isn't an option, then we can't layer Clock Synchronization over State Synchronization, but instead we'd need to generate the clock synchronization timestamps right before the state gets sent by State Synchronization.

Due to the way State Synchronization and Event Synchronization work, on startup the backend will receive messages with timestamps before it has the actual time offset from the firmware. The backend could initialize the delay to some estimate (e.g. 30 ms) in order to calculate an initial inaccurate offset, and then once it has actually properly measured the delay via an additional request/response pair it could improve the estimate of the delay and thus of the time offset.

ethanjli commented 2 years ago

Ideally if/when we need to resynchronize clocks due to clock drift, the timestamps shouldn't jump around discontinuously, but rather should be skewed so that times can be monotonic (see pages 8-9 of these slides). We could try to do this using the STM32H7's internal RTC (which supports smooth digital calibration), but that's probably too complicated; instead, it may be better to gradually adjust the offsets in the backend server. This could probably be done by gradually interpolating the offset between a previous offset value and a target offset value.

However, we will need to be careful about adjustment of offsets when the RPi's system time is discontinuously changed, e.g. from user adjustment in the frontend, or due to a firmware restart while the backend is running. In NTP, clock skew is applied when clock drift is less than 128 ms, while a discontinuous jump is used when drift is greater than 128 ms.

ethanjli commented 2 years ago

When I show seconds in the Event Log (rather than only hours and minutes), I can see timestamps for log events generated in the firmware which are greater than timestamps for subsequent log events generated in the backend. So the current inaccuracy of clock synchronization is a significant issue for the accuracy of the event log.

Additionally, timestamps for events generated in the simulator backend (but not the application backend) are completely incorrect in a different way: the seconds of the timestamp gets stuck for a while, and then it jumps ahead to the current time, and then it gets stuck again, etc.

pez-globo / pufferfish-software

Improve backend-firmware clock synchronization #424