Improve synchronisation between tedge daemons

albinsuresh commented 2 years ago

Is your feature request related to a problem? Please describe.

There are many interactions between different tedge daemons where one expects the other daemon to be up and running to respond to any requests that it sends. Here are some examples:

Once the c8y mapper starts, request software list from tedge-agent
Once the c8y-bridge is created, make c8y-config-plugin send the list of all supported config files
Once the c8y-bridge is created, make c8y-log-plugin send the list of all supported log files
Once the c8y-mapper starts, make tedge-agent send the status of the last "executing" operation, if any

Because of these dependencies, some of these messages could be lost if the requestee service is not up and running when the requester service sends a request.

Currently, we rely on MQTT broker's persistence-session feature to have these requests persisted, even if the requestee service is not up and running to receive those requests. But the broker keeps them persisted to deliver it to that service when it starts later. But, this persistence session feature is not very stable on the mosquitto broker that we use and hence we need a solution that doesn't fully rely on this feature.

Describe the solution you'd like

Make daemons request data from other daemons only once their liveness is validated. For example, c8y-mapper should request the the software list from tedge-agent only once it can confirm that tedge-agent is up an running. The tedge/health endpoints of these daemons could be used to check this liveness.

Describe alternatives you've considered

Defining systemd service dependencies could be an alternative, but there are many cases where some service pairs have dependencies on each other, leading to cyclic dependencies. Even otherwise, it would have been a systemd specific solution.

didier-wenzek commented 2 years ago

I want to highlight that the issue is beyond the stability issues we observed on mosquitto 2.0.

The agent can successfully receive a message published by the mapper even if it was down when the message was published. But the agent must have been launched at least once creating a subscription persisted in a named session. Messages sent just after the very first installation might be lost. This is why we introduced the tedge_agent --init option to create this persisted session on install ... with a major drawback : messages are consumed and discarded if the option is used after install.
The issue with the bridge to Cumulocity is deeper. The bridge is created on tedge connect c8y and remove on tedge disconnect c8y. After the first install and after a disconnect, there is no more bridge, i.e. the topics c8y/# are topics without any subscribers. Any message published on a c8y/# topic before a tedge connect c8y will be lost. Here the --init trick to create a session cannot work - because the subscriber is mosquitto itself.

I see the fix proposed here as the right approach.

The thin-edge daemons must send up-status on start, down-status as last will.
Init messages sent to peers must be published only as reaction to up-status messages received from these peers.
It's okay to re-send init messages after each restart of the bridge or of one of the thin-edge daemon. However, we must avoid to resend these after each health-check response.
The --init option of the mapper and the agent must no more create a session, because of the risk to discard messages using --init inadvertently.

PradeepKiruvale commented 1 year ago

Addressed this through below PRs

https://github.com/thin-edge/thin-edge.io/pull/2065

https://github.com/thin-edge/thin-edge.io/pull/2050

Created a follow-up ticket: https://github.com/thin-edge/thin-edge.io/issues/2070

thin-edge / thin-edge.io

Improve synchronisation between tedge daemons #1201