thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
222 stars 55 forks source link

All service statuses are set to down and remain down after restarting mosquitto #1774

Closed reubenmiller closed 1 year ago

reubenmiller commented 1 year ago

Describe the bug

When restarting the mosquitto mqtt broker on a thin-edge device, the service status of each thin-edge component is reported as being "down" even though the systemd services are in-fact still functioning fine.

To Reproduce

  1. Start all services (mosquitto, tedge-agent, tedge-mapper-*, c8y-log-plugin, c8y-configuration-plugin etc.)
  2. Check that the service status are "up" (green) in Cumulocity IoT Device Management windows
  3. Restart the mosquitto service (sudo systemctl restart mosquitto)
  4. Check the services in the UI...all of the services are reported as being "down" however the services are still functioning in systemd, it is just the "up" status which is not resent.

Expected behavior

Each service which reports its health status to the mqtt broker should send a service "up" message once the connection has been reestablished.

Screenshots

Environment (please complete the following information):

Additional context

PradeepKiruvale commented 1 year ago

Looks like this is the downside of using the mqtt Last will message feature for sending the status of service when the connection goes down.

reubenmiller commented 1 year ago

No this is not a limitation on the Last will and testament message, that worked just fine, we just need to change how we resend the up status when each client is reconnected (and confirm that it has been sent, e.g. using qos >= 1)

didier-wenzek commented 1 year ago

I'm afraid this is not a direct fix. This is not just about re-sending an up message. One needs to detect that the connection has been lost and recovered. There is an event raised by rumqtt; but that event is not propagated to the MQTT channel used by thin-edge daemons.

I see two options - with a preference for the second one:

reubenmiller commented 1 year ago

I'm afraid this is not a direct fix. This is not just about re-sending an up message. One needs to detect that the connection has been lost and recovered. There is an event raised by rumqtt; but that event is not propagated to the MQTT channel used by thin-edge daemons.

I see two options - with a preference for the second one:

  • Forward connection events along the MQTT channels as a new type of messages. This would give full freedom to the daemons on how to react and this can even been later improved with other low level events. However, this would impact all the users of the MQTT channels, having then to deal with these new events.
  • Attach to each MQTT channel a function to create an init message. A fresh init message will then be generated and sent behind the scene each time the connection is established. Config::with_init_message(create: Fn(()) -> Message).

Yes, option two sounds like a good way to go.

gligorisaev commented 1 year ago

Checked, it is Ok Covered with the /cumulocity/service_monitoring/service_monitoring.robot