Closed reubenmiller closed 1 year ago
~Very closely related to a similar issue #1934~
The symptom (long service restart times) is the same as #1934
Very closely related to a similar issue #1934
Not so similar, even if the effect is the same.
The actor itself handle shutdown requests but there is a blocking timeout in the message handler itself:
tokio::time::timeout(Duration::from_secs(10), mqtt_con.received.next()).await
A duplicate has also been introduced and there are now two JWT handlers :-(
The issue is present in both.
So, this is in fact more subtle and this is due to an old friend.
The c8y mapper needs to know the internal id of the device and loop till this internal id is fetched from Cumulocity. This internal id is required only for some operation (notably to post the software list).
Currently Runtime::Shutdown
requests are ignored during this loop.
This can be fixed using different approaches:
@reubenmiller what would be your preference?
- Store on disk the internal id to avoid having to request it each time the mapper is launched.
The last one please. Though in addition there should be some invalidation logic when using the internal-id code. For example if any of the Cumulocity HTTP requests which use the internal-id return; 403 or 404 HTTP status codes, then the internal id should be discarded and the internal-id should be looked up again using the external id.
The following HTTP response status codes can occur for the following reasons:
404 (Not Found): Occurs if the managed object is deleted in Cumulocity (this actually occurs more often than you think, especially in development). 403 (Permission denied): The device is no longer the owner of the managed object, this can happen if someone has been changing the owner of the managed object, or moved the external id from one managed object to another (migration scenario).
Tested and it is ok
Describe the bug
When the
tedge-mapper-c8y
is trying to get a Cumulocity IoT JWT (token) but is having issues, it waits 60 seconds and tries again. It looks like the "wait 60 seconds" logic (or the JWT retriever actor) does not subscribe to the shutdown signal. This is visible because when restarting the service via systemd (e.g. `systemctl restart tedge-mapper-c8y), it takes the service ~30 seconds to restart.The actor runtime at least handles this "unresponsiveness" as shown be the log snippet below, however the
tedge-mapper-c8y
in normal scenarios (e.g. when the JWT token retry logic is running) theRuntime: Timeout
errors should not occur.To Reproduce
Assuming you have an already running thin-edge.io setup (including a connection to Cumulocity IoT):
/etc/tedge/mosquitto-conf/c8y-bridge.conf
bridge settings to create invalid settings, though then you will need to restartmosquitto
usingsystemctl restart mosquitto
)tedge-mapper-c8y
usingsystemctl restart tedge-mapper-c8y
(if the restart takes < 2 seconds, then try restarting the service again)Expected behavior
Screenshots
The journalctl logs from the
tedge-mapper-c8y
service are shown below. The snippet shows the lead up to executing thesystemctl restart tedge-mapper-c8y
command, and after the service restartsEnvironment (please complete the following information):
Additional context