thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
219 stars 54 forks source link

service status is out of sync with cloud after a reboot (using yocto kirkstone image) #2498

Closed reubenmiller closed 9 months ago

reubenmiller commented 9 months ago

Describe the bug

After a device reboot, the thin-edge.io services are all set to "down" (red icon) in Cumulocity IoT, however on the local MQTT broker, the service status is healthy (and the systemd services also look healthy).

image
# tedge mqtt sub 'te/device/main/service/+/status/health'
INFO: Connected
[te/device/main/service/mosquitto-c8y-bridge/status/health] 1
[te/device/main/service/tedge-configuration-plugin/status/health] {"pid":358,"status":"up","time":"2023-12-04T22:18:54.902342628Z"}
[te/device/main/service/tedge-agent/status/health] {"pid":357,"status":"up","time":"2023-12-04T22:18:54.920607091Z"}
[te/device/main/service/tedge-log-plugin/status/health] {"pid":359,"status":"up","time":"2023-12-04T22:18:54.90731798Z"}
[te/device/main/service/tedge-mapper-c8y/status/health] {"pid":360,"status":"up","time":"2023-12-04T22:18:54.893884313Z"}

Restarting the tedge-mapper-c8y service refreshes the service status in Cumulocity IoT.

Workaround

Publish a health check MQTT message to force all services to update their statuses:

tedge mqtt pub -r 'te/device/main///cmd/health/check' '{}'

To Reproduce

  1. Install thin-edge.io and set everything up

  2. Reboot device (and wait for device to reboot)

  3. Check the service status in Cumulocity IoT

  4. Check the service status on the local MQTT broker

    tedge mqtt sub 'te/device/main/service/+/status/health'
  5. Restart the tedge-mapper-c8y manually (check if this updates the service status)

    systemctl restart tedge-agent

    Alternatively, a health check request message can be published to trigger all of the services to update their statuses:

    tedge mqtt pub -r 'te/device/main///cmd/health/check' '{}'

Expected behavior

The service status in the cloud should align with the local MQTT broker status (provided the MQTT bridge connection is functional).

Screenshots

Environment (please complete the following information):

Property Value
OS [incl. version] Poky (Yocto Project Reference Distro) 4.0.14 (kirkstone)
Hardware [incl. revision] Raspberry Pi 4 Model B Rev 1.1
System-Architecture Linux rpi4-dca632486720 5.15.34-v8 #1 SMP PREEMPT Tue Apr 19 19:21:26 UTC 2022 aarch64 GNU/Linux
thin-edge.io version tedge 0.13.2~118+g8457fa6
mosquitto version 2.0.18

NOTES The linux distribution is not making using of a hardware clock (hwlock) or a fake-hwclock, so after a reboot, the system time reverts to 1970 until the network is established and the system clock has been synchronized using NTP.

Additional context

reubenmiller commented 9 months ago

@Bravo555 I've created some yocto images for a raspberrypi 3 and 4 (64 bit), that you should be able to use to reproduce it (if you can't reproduce it using a container).

https://github.com/thin-edge/meta-tedge-project/releases/tag/20231205.18XX

reubenmiller commented 9 months ago

Here are the journald logs (since boot). Below describes the procedure and how the logs were collected

journalctl --boot

Procedure

  1. Reboot the device (this part is not in the logs but shouldn't be that interesting)
  2. Wait for device to boot (after ~15 seconds, the service status is out of sync with the cloud)
  3. Manually restart tedge-mapper-c8y using systemctl restart tedge-mapper-c8y - afterwards the service state in the cloud is in sync

rpi3-b827ebe1f7d6.journald.txt

Bravo555 commented 9 months ago

I was able to reproduce a very similar, but not quite the same issue on a development container by deleting mosquitto message database under /var/lib/mosquitto/mosquitto.db:

  1. Connect to Cumulocity using tedge connect c8y
  2. Stop mosquitto systemctl stop mosquitto.
  3. Verify that on Cumulocity "Services" page tedge-mapper-c8y status is "down" and tedge-mapper is "up.
  4. Delete the /var/lib/mosquitto/mosquitto.db file.
  5. Start mosquitto systemctl start mosquitto
  6. Verify that on Cumulocity "Services" page tedge-mapper-c8y is still down and additionally tedge-mapper is also down.
  7. Verify that on the local MQTT broker, both services are marked as "up"
    [te/device/main/service/mosquitto-c8y-bridge/status/health] 1
    [te/device/main/service/tedge-agent/status/health] {"pid":10269,"status":"up","time":"2023-12-18T14:08:12.895593918Z"}
    [te/device/main/service/tedge-mapper-c8y/status/health] {"pid":10268,"status":"up","time":"2023-12-18T14:08:12.896283834Z"}

Additionally, it's not that the status is merely out of sync, but the mapper actually borked and doesn't convert messages, which can be trivially confirmed by publishing a measurement message and seeing that the mapper doesnt publish expected message on c8y/measurement/measurements/create topic. That's probably because after reconnecting the broker lost client subscriptions, but the clients do not resend them, so there are no subscriptions and clients do not receive new messages.

In the log I found these lines:

Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Mounting /var/volatile...
Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Mounted /var/volatile.
Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/cache was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/cache).
Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/lib was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/lib).

...

Jan 01 00:00:09 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/lib was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/lib).
Jan 01 00:00:09 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/spool was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/spool).

Previously, with @Ruadhri17, we stumbled on the default Yocto behaviour of mounting a volatile filesystem under /var/volatile and then binding /var/log and /var/lib to /var/volatile, which was the source of another bug that I can't quite remember.

As the end devices don't have tons of spare space to store logs, this is probably not something we want to disable in our yocto layer, but we should instead make actors more resilient against these kinds of broker restarts where the messages are not preserved.

One thing is still not clear though - mosquitto starts first, and only then our thin-edge daemons. At no point while they're working broker is disabled and previously sent messages removed. When trying to replicate the same scenario (stop all tedge daemons and mosquitto, disable networking, delete message database, start mosquitto, start tedge-daemons, enable networking), I couldn't reproduce this behaviour. In the case I found, sending a health check message does not fix the problem, because the mapper is completely broken, but in the reported case, it does.

Still, daemons getting borked after reconnecting when previously messages were lost by the broker is probably not ideal and should be fixed regardless, and there is some possibility that it could fix this issue.

reubenmiller commented 9 months ago

Previously, with @Ruadhri17, we stumbled on the default Yocto behaviour of mounting a volatile filesystem under /var/volatile and then binding /var/log and /var/lib to /var/volatile, which was the source of another bug that I can't quite remember.

In the image under test, the mosquitto logs are stored on a persisted location so this shouldn't be an influencing factor (path is /data/mosquitto/mosquitto.db).

reubenmiller commented 9 months ago

Updated images can be found here: https://github.com/thin-edge/meta-tedge-project/releases/tag/20231219.0941

rina23q commented 9 months ago

I could reproduce this bug on the 20231219.0941 image. However, no longer reproducible on the 20231220.2304 image.


Here is the firmware version reported. Screenshot 2023-12-22 at 11 31 11

I triggered the restart at 2023-12-22 11:33:22

Screenshot 2023-12-22 at 11 34 20

After triggering the restart, I saw tedge-mapper-c8y and tedge-agent services went down. Then, they went up again. They are reported after the restart for sure as the last update is 2023-12-22 11:34:07.

Screenshot 2023-12-22 at 11 37 46

So, I would close the ticket now. If we find the same behaviour, then feel free to reopen this.