thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
221 stars 54 forks source link

Restart operation stuck in EXECUTING state #1135

Closed rina23q closed 1 year ago

rina23q commented 2 years ago

Describe the bug After triggering a restart operation, the device was restarted successfully. However, the operation status in Cumulocity is still in EXECUTING.

To Reproduce Trigger a restart operation from Cumulocity.

Expected behavior The tedge-agent reports "SUCCESSFUL" to Cumulocity.

Screenshots Screenshot from 2022-05-11 09-09-55

Environment (please complete the following information):

Additional context

albinsuresh commented 2 years ago

This is the result of a Mosquitto bug where it drops any new messages meant for a persistent client after a restart until that client connects back. Here is a demonstration of the same using mosquitto clients:

  1. Start a persistent subscriber to a test topic
  2. Stop that subscriber
  3. Publish the first message to that test topic while the persistent subscriber is down
  4. Restart mosquitto
  5. Publish a second message to that test topic while the persistent subscriber is still down
  6. Restart the persistent client with the same client id and notice that it only receives the first message published before mosquitto restart and not the second message published after the restart
  7. Stop that subscriber
  8. Publish a third message to that test topic while the persistent subscriber is down
  9. Restart that persistent subscriber again and see that the third message is also received but the second message is still missing
alsu@alsu-VirtualBox:~$ mosquitto_sub -v -i tracing-client -c -q 1 -t 'test/topic'
alsu@alsu-VirtualBox:~$ mosquitto_pub -q 1 -t test/topic -m hello1
alsu@alsu-VirtualBox:~$ sudo systemctl restart mosquitto.service 
alsu@alsu-VirtualBox:~$ mosquitto_pub -q 1 -t test/topic -m hello2
alsu@alsu-VirtualBox:~$ mosquitto_sub -v -i tracing-client -c -q 1 -t 'test/topic'
test/topic hello1
alsu@alsu-VirtualBox:~$ mosquitto_pub -q 1 -t test/topic -m hello3
alsu@alsu-VirtualBox:~$ mosquitto_sub -v -i tracing-client -c -q 1 -t 'test/topic'
test/topic hello3

As shown above, any new messages, meant to be delivered to a persistent client, received after a restart are dropped until that persistent client connects back. The weird thing is that the messages received before the restart are still delivered to that client as soon as it connects back.

In our case with the status update of restart operation, the message is indeed sent by the agent after the device restart. But the c8y-mapper typically starts late and hence it doesn't get this message sent earlier by tedge-agent. The mapper should have gotten it when it connects, as it has a persistent session with the broker, but doesn't because of the message drop bug showcased above.

albinsuresh commented 2 years ago

Found that this was an issue recognised by Mosquitto as https://github.com/eclipse/mosquitto/issues/769 and fixed in v1.6.11. But, unfortunately this mosquitto version on the 1.6 release line is not available in the Ubuntu offical apt repo. So, will have to install it manually, or switch to the latest 2.x release line.

albinsuresh commented 2 years ago

Since Mosquitto 1.6.11 or higher is not even available on the public Debian repositories, and since the 2.x line also has several bugs around persistence sessions, we'll have to fix this issue in thin-edge itself to workaround this persistence session bug of Cumulocity.

This can be done by introducing a new request-response contract between the mapper and the agent. In addition to the agent sending the status of the last executed operation on its startup, we'll also introduce another topic like tedge/init on tedge-agent which can be triggered by the mapper to make the agent send the status of the last executed operation.

In future this trigger can be extended to make the tedge-agent send all kinds of init messages like supported operations, current software list etc. This will help any external plugin or mapper to always get a summary of the agent's current status with a simple trigger.

reubenmiller commented 1 year ago

Closing ticket as debian buster and earlier will not get the updated mosquitto versions with this bug fix as they have moved to the oldstable (and below) status. debian bullseye is the current stable version, and mosquitto 2.0.11 is the publicly available version.

However, if you would like to install a newer version of mosquito which includes the mosquitto bugfix, then you can check debian repositories provided by the mosquitto page directly.