Closed reubenmiller closed 9 months ago
@Bravo555 I've created some yocto images for a raspberrypi 3 and 4 (64 bit), that you should be able to use to reproduce it (if you can't reproduce it using a container).
https://github.com/thin-edge/meta-tedge-project/releases/tag/20231205.18XX
Here are the journald logs (since boot). Below describes the procedure and how the logs were collected
journalctl --boot
Procedure
systemctl restart tedge-mapper-c8y
- afterwards the service state in the cloud is in syncI was able to reproduce a very similar, but not quite the same issue on a development container by deleting mosquitto message database under /var/lib/mosquitto/mosquitto.db
:
tedge connect c8y
systemctl stop mosquitto
.tedge-mapper-c8y
status is "down" and tedge-mapper is "up./var/lib/mosquitto/mosquitto.db
file.systemctl start mosquitto
tedge-mapper-c8y
is still down and additionally tedge-mapper
is also down.[te/device/main/service/mosquitto-c8y-bridge/status/health] 1
[te/device/main/service/tedge-agent/status/health] {"pid":10269,"status":"up","time":"2023-12-18T14:08:12.895593918Z"}
[te/device/main/service/tedge-mapper-c8y/status/health] {"pid":10268,"status":"up","time":"2023-12-18T14:08:12.896283834Z"}
Additionally, it's not that the status is merely out of sync, but the mapper actually borked and doesn't convert messages, which can be trivially confirmed by publishing a measurement message and seeing that the mapper doesnt publish expected message on c8y/measurement/measurements/create
topic. That's probably because after reconnecting the broker lost client subscriptions, but the clients do not resend them, so there are no subscriptions and clients do not receive new messages.
In the log I found these lines:
Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Mounting /var/volatile...
Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Mounted /var/volatile.
Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/cache was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/cache).
Jan 01 00:00:07 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/lib was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/lib).
...
Jan 01 00:00:09 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/lib was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/lib).
Jan 01 00:00:09 rpi3-b827ebe1f7d6 systemd[1]: Bind mount volatile /var/spool was skipped because of a failed condition check (ConditionPathIsReadWrite=!/var/spool).
Previously, with @Ruadhri17, we stumbled on the default Yocto behaviour of mounting a volatile filesystem under /var/volatile
and then binding /var/log
and /var/lib
to /var/volatile
, which was the source of another bug that I can't quite remember.
As the end devices don't have tons of spare space to store logs, this is probably not something we want to disable in our yocto layer, but we should instead make actors more resilient against these kinds of broker restarts where the messages are not preserved.
One thing is still not clear though - mosquitto starts first, and only then our thin-edge daemons. At no point while they're working broker is disabled and previously sent messages removed. When trying to replicate the same scenario (stop all tedge daemons and mosquitto, disable networking, delete message database, start mosquitto, start tedge-daemons, enable networking), I couldn't reproduce this behaviour. In the case I found, sending a health check message does not fix the problem, because the mapper is completely broken, but in the reported case, it does.
Still, daemons getting borked after reconnecting when previously messages were lost by the broker is probably not ideal and should be fixed regardless, and there is some possibility that it could fix this issue.
Previously, with @Ruadhri17, we stumbled on the default Yocto behaviour of mounting a volatile filesystem under /var/volatile and then binding /var/log and /var/lib to /var/volatile, which was the source of another bug that I can't quite remember.
In the image under test, the mosquitto logs are stored on a persisted location so this shouldn't be an influencing factor (path is /data/mosquitto/mosquitto.db
).
Updated images can be found here: https://github.com/thin-edge/meta-tedge-project/releases/tag/20231219.0941
I could reproduce this bug on the 20231219.0941
image. However, no longer reproducible on the 20231220.2304
image.
Here is the firmware version reported.
I triggered the restart at 2023-12-22 11:33:22
After triggering the restart, I saw tedge-mapper-c8y
and tedge-agent
services went down. Then, they went up again.
They are reported after the restart for sure as the last update is 2023-12-22 11:34:07.
So, I would close the ticket now. If we find the same behaviour, then feel free to reopen this.
Describe the bug
After a device reboot, the thin-edge.io services are all set to "down" (red icon) in Cumulocity IoT, however on the local MQTT broker, the service status is healthy (and the systemd services also look healthy).
Restarting the
tedge-mapper-c8y
service refreshes the service status in Cumulocity IoT.Workaround
Publish a health check MQTT message to force all services to update their statuses:
To Reproduce
Install thin-edge.io and set everything up
Reboot device (and wait for device to reboot)
Check the service status in Cumulocity IoT
Check the service status on the local MQTT broker
Restart the tedge-mapper-c8y manually (check if this updates the service status)
Alternatively, a health check request message can be published to trigger all of the services to update their statuses:
Expected behavior
The service status in the cloud should align with the local MQTT broker status (provided the MQTT bridge connection is functional).
Screenshots
Environment (please complete the following information):
Poky (Yocto Project Reference Distro) 4.0.14 (kirkstone)
Raspberry Pi 4 Model B Rev 1.1
Linux rpi4-dca632486720 5.15.34-v8 #1 SMP PREEMPT Tue Apr 19 19:21:26 UTC 2022 aarch64 GNU/Linux
tedge 0.13.2~118+g8457fa6
2.0.18
NOTES The linux distribution is not making using of a hardware clock (hwlock) or a fake-hwclock, so after a reboot, the system time reverts to 1970 until the network is established and the system clock has been synchronized using NTP.
Additional context