thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
221 stars 54 forks source link

tedge-mapper-c8y does not shutdown cleanly if JWT retriever is retrying #1962

Closed reubenmiller closed 1 year ago

reubenmiller commented 1 year ago

Describe the bug

When the tedge-mapper-c8y is trying to get a Cumulocity IoT JWT (token) but is having issues, it waits 60 seconds and tries again. It looks like the "wait 60 seconds" logic (or the JWT retriever actor) does not subscribe to the shutdown signal. This is visible because when restarting the service via systemd (e.g. `systemctl restart tedge-mapper-c8y), it takes the service ~30 seconds to restart.

The actor runtime at least handles this "unresponsiveness" as shown be the log snippet below, however the tedge-mapper-c8y in normal scenarios (e.g. when the JWT token retry logic is running) the Runtime: Timeout errors should not occur.

2023-05-11T06:28:17.213441502Z ERROR Runtime: Timeout waiting for all actors to shutdown

To Reproduce

Assuming you have an already running thin-edge.io setup (including a connection to Cumulocity IoT):

  1. Disconnect the network (or alternatively, edit the /etc/tedge/mosquitto-conf/c8y-bridge.conf bridge settings to create invalid settings, though then you will need to restart mosquitto using systemctl restart mosquitto)
  2. Restart tedge-mapper-c8y using systemctl restart tedge-mapper-c8y (if the restart takes < 2 seconds, then try restarting the service again)

Expected behavior

Screenshots

The journalctl logs from the tedge-mapper-c8y service are shown below. The snippet shows the lead up to executing the systemctl restart tedge-mapper-c8y command, and after the service restarts

May 11 06:26:30 d1d433801948 tedge-mapper[87]: 2023-05-11T06:26:30.844659577Z  INFO C8Y-REST => JWT: send ()
May 11 06:26:30 d1d433801948 tedge-mapper[87]: 2023-05-11T06:26:30.844693577Z  INFO C8YJwtRetriever: recv Some((0, ()))
May 11 06:26:40 d1d433801948 tedge-mapper[87]: 2023-05-11T06:26:40.84598579Z  INFO C8YJwtRetriever: send (0, Err(NoResponse))
May 11 06:26:40 d1d433801948 tedge-mapper[87]: 2023-05-11T06:26:40.846059123Z  INFO C8Y-REST => JWT: recv Some(Err(NoResponse))
May 11 06:26:40 d1d433801948 tedge-mapper[87]: 2023-05-11T06:26:40.846067915Z ERROR c8y_http_proxy::actor: An error occurred while retrieving internal Id, operation will retry in 60 seconds
May 11 06:26:40 d1d433801948 tedge-mapper[87]:  Error: CustomError("JWT token not available")
May 11 06:27:17 d1d433801948 systemd[1]: Stopping tedge-mapper-c8y converts Thin Edge JSON measurements to Cumulocity JSON format....
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211797724Z  INFO Signal-Handler: send Shutdown
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211839224Z  INFO Runtime: Shutting down
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211851224Z  INFO Runtime: Successfully sent shutdown request to MQTT-2
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211866724Z  INFO Runtime: Successfully sent shutdown request to CumulocityMapper-8
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211870349Z  INFO Runtime: Successfully sent shutdown request to MQTT-9
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.21187339Z  INFO Runtime: Successfully sent shutdown request to Signal-Handler-0
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211875932Z  INFO Runtime: Successfully sent shutdown request to C8YJwtRetriever-3
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211888307Z  INFO Runtime: Successfully sent shutdown request to C8Y-REST-5
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211903432Z  INFO Runtime: Successfully sent shutdown request to HealthMonitorActor-1
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211907599Z  INFO Runtime: Successfully sent shutdown request to Timer-7
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211910474Z  INFO Runtime: Successfully sent shutdown request to FsWatcher-6
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211912765Z  INFO Runtime: Successfully sent shutdown request to HTTP-4
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.211927307Z  INFO HTTP: recv None
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212101224Z  INFO Runtime: Actor has finished: HTTP-4
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212158849Z  INFO MQTT: recv Err(Shutdown)
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212181807Z  INFO Runtime: Actor has finished: MQTT-9
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212195474Z  INFO tedge-mapper-c8y: recv None
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212221057Z ERROR Runtime: Actor FsWatcher-6 has finished unsuccessfully: ChannelError(ReceiveError)
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.21226839Z  INFO Signal-Handler: recv None
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212280224Z  INFO C8YJwtRetriever: recv None
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212288474Z  INFO Runtime: Actor has finished: C8YJwtRetriever-3
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212344474Z  INFO Runtime: Actor has finished: HealthMonitorActor-1
May 11 06:27:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:17.212381807Z  INFO Runtime: Actor has finished: Signal-Handler-0
May 11 06:27:40 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:40.846804943Z  INFO C8Y-REST => JWT: send ()
May 11 06:27:40 d1d433801948 tedge-mapper[87]: 2023-05-11T06:27:40.846840902Z ERROR c8y_http_proxy::actor: An error occurred while retrieving internal Id, operation will retry in 60 seconds
May 11 06:27:40 d1d433801948 tedge-mapper[87]:  Error: FromChannel(SendError(SendError { kind: Disconnected }))
May 11 06:28:17 d1d433801948 tedge-mapper[87]: 2023-05-11T06:28:17.213441502Z ERROR Runtime: Timeout waiting for all actors to shutdown
May 11 06:28:17 d1d433801948 systemd[1]: tedge-mapper-c8y.service: Succeeded.
May 11 06:28:17 d1d433801948 systemd[1]: Stopped tedge-mapper-c8y converts Thin Edge JSON measurements to Cumulocity JSON format..
May 11 06:28:17 d1d433801948 systemd[1]: Started tedge-mapper-c8y converts Thin Edge JSON measurements to Cumulocity JSON format..
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.223618836Z  INFO flockfile::unix: Lockfile created "/run/lock/tedge-mapper-c8y.lock"
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.223785419Z  INFO Runtime: Started
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.223805544Z  INFO Runtime: Running Signal-Handler-0
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.223815044Z  INFO Runtime: Running HealthMonitorActor-1
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.223856752Z  INFO tedge-mapper-c8y: send Message { topic: Topic { name: "tedge/health/tedge-mapper-c8y" }, payload: "{"pid":7211,"status":"up","time":1683786497}", qos: AtLeastOnce, retain: true }
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227124086Z  INFO Runtime: Running MQTT-2
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227154544Z  INFO Runtime: Running C8YJwtRetriever-3
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227160127Z  INFO Runtime: Running HTTP-4
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227184586Z  INFO Runtime: Running C8Y-REST-5
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227191627Z  INFO Runtime: Running FsWatcher-6
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227197002Z  INFO Runtime: Running Timer-7
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227477002Z  INFO C8Y-REST: start initialisation
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227520961Z  INFO C8Y-REST => JWT: send ()
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.227540169Z  INFO C8YJwtRetriever: recv Some((0, ()))
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.228134002Z  INFO Runtime: Running CumulocityMapper-8
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.228144336Z  INFO Runtime: Running MQTT-9
May 11 06:28:17 d1d433801948 tedge-mapper[7211]: 2023-05-11T06:28:17.229476294Z  INFO MQTT: recv Ok(Some(Message { topic: Topic { name: "tedge/health/tedge-mapper-c8y" }, payload: "{"pid":7211,"status":"up","time":1683786497}", qos: AtLeastOnce, retain: true }))

Environment (please complete the following information):

Additional context

reubenmiller commented 1 year ago

~Very closely related to a similar issue #1934~

The symptom (long service restart times) is the same as #1934

didier-wenzek commented 1 year ago

Very closely related to a similar issue #1934

Not so similar, even if the effect is the same.

The actor itself handle shutdown requests but there is a blocking timeout in the message handler itself: tokio::time::timeout(Duration::from_secs(10), mqtt_con.received.next()).await

A duplicate has also been introduced and there are now two JWT handlers :-(

The issue is present in both.

didier-wenzek commented 1 year ago

So, this is in fact more subtle and this is due to an old friend.

The c8y mapper needs to know the internal id of the device and loop till this internal id is fetched from Cumulocity. This internal id is required only for some operation (notably to post the software list).

Currently Runtime::Shutdown requests are ignored during this loop.

This can be fixed using different approaches:

@reubenmiller what would be your preference?

reubenmiller commented 1 year ago
  • Store on disk the internal id to avoid having to request it each time the mapper is launched.

The last one please. Though in addition there should be some invalidation logic when using the internal-id code. For example if any of the Cumulocity HTTP requests which use the internal-id return; 403 or 404 HTTP status codes, then the internal id should be discarded and the internal-id should be looked up again using the external id.

The following HTTP response status codes can occur for the following reasons:

404 (Not Found): Occurs if the managed object is deleted in Cumulocity (this actually occurs more often than you think, especially in development). 403 (Permission denied): The device is no longer the owner of the managed object, this can happen if someone has been changing the owner of the managed object, or moved the external id from one managed object to another (migration scenario).

gligorisaev commented 1 year ago

Tested and it is ok