thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
221 stars 54 forks source link

flaky test: Update tedge version from base to current using Cumulocity #2623

Closed reubenmiller closed 8 months ago

reubenmiller commented 9 months ago

Describe the bug

There is a flaky system tests:

Failed runs are seen here:

To Reproduce

Expected behavior

Screenshots

Environment (please complete the following information):

Additional context

reubenmiller commented 9 months ago

After investigating, this is a bug in the tedge-agent (not the test). It is related to https://github.com/thin-edge/thin-edge.io/issues/2501

The following contains log entries which show the missing 503 message which results in the software update operation remaining in the EXECUTING state after the agent has performed the update.

Case: Failed

In the failed case, the 503,c8y_SoftwareUpdate message is not sent before the tedge-agent process exits, and the process does not attempt to send a message when it starts up.

full failed.log

Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.381178295Z  INFO tedge_agent::software_manager::actor: Checking if tedge got self updated
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.381197381Z  INFO tedge_agent::software_manager::actor: Current running version: 0.0.1
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.38183023Z  INFO tedge_agent::software_manager::actor: Installed binary version: 1.0.0-rc.2~124+ge534ed3
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.381843875Z  INFO tedge_agent::software_manager::actor: Self update detected. Requesting shutdown...
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.381848895Z ERROR tedge_agent::software_manager::actor: Tedge-agent is no more running the latest-version => a restart is required
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.381893579Z ERROR Runtime: Actor SoftwareManagerActor-11 has finished unsuccessfully: ActorError(NotRunningLatestVersion)
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.381903047Z  INFO Runtime: Shutting down on error: Tedge-agent is not running the latest version
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.381959553Z  INFO Runtime: Actor has finished: FsWatcher-5
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382056465Z  INFO tedge_agent::file_transfer_server::actor: Shutdown
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382100728Z  INFO mqtt_channel::connection: MQTT connection closed
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Received PUBLISH from tedge-agent#te/device/main// (d0, q1, r1, m17, 'te/device/main/service/tedge-agent/status/health', ... (27 bytes))
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Sending PUBLISH to tedge-mapper-c8y (d0, q1, r0, m64, 'te/device/main/service/tedge-agent/status/health', ... (27 bytes))
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382205815Z  INFO Runtime: Actor has finished: MQTT-4
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382216245Z  INFO Runtime: Actor has finished: Signal-Handler-3
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382222677Z  INFO Runtime: Actor has finished: HealthMonitorActor-14
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382227977Z  INFO Runtime: Actor has finished: TedgetoTeConverter-0
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382233046Z  INFO Runtime: Actor has finished: LogManager-9
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382244037Z  INFO Runtime: Actor has finished: Downloader-6
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382249287Z  INFO Runtime: Actor has finished: RestartManagerActor-10
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382254286Z  INFO Runtime: Actor has finished: ConfigManager-8
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382259045Z  INFO Runtime: Actor has finished: Script-12
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382264005Z  INFO Runtime: Actor has finished: Uploader-7
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Sending PUBLISH to auto-65FF3E37-8DCB-5F97-81E3-35C7AFC6ADB9 (d0, q0, r0, m0, 'te/device/main/service/tedge-agent/status/health', ... (27 bytes))
Jan 25 08:33:27 03d75d9284e7 mqtt-logger[428]: {"timestamp":1706171607.382505558,"message":{"tst":"2024-01-25T08:33:27.382522+0000","topic":"te/device/main/service/tedge-agent/status/health","qos":0,"retain":0,"payloadlen":27,"payload":"{\"pid\":518,\"status\":\"down\"}"},"payload_hex":"7b22706964223a3531382c22737461747573223a22646f776e227d"}
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382268823Z  INFO Runtime: Actor has finished: HttpFileTransferServer-1
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.382273983Z  INFO Runtime: Actor has finished: FileCacheActor-2
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Sending PUBACK to tedge-agent#te/device/main// (m17, rc0)
Jan 25 08:33:27 03d75d9284e7 mqtt-logger[428]: {"timestamp":1706171607.382617468,"message":{"tst":"2024-01-25T08:33:27.382626+0000","topic":"c8y/s/us/TST_concentrate_warm_wave:device:main:service:tedge-agent","qos":0,"retain":0,"payloadlen":8,"payload":"104,down"},"payload_hex":"3130342c646f776e"}
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Received DISCONNECT from tedge-agent#te/device/main//
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Client tedge-agent#te/device/main// disconnected.
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Received PUBACK from tedge-mapper-c8y (Mid: 64, RC:0)
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Received PUBLISH from tedge-mapper-c8y (d0, q1, r0, m24, 'c8y/s/us/TST_concentrate_warm_wave:device:main:service:tedge-agent', ... (8 bytes))
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Sending PUBLISH to Cumulocity (d0, q1, r0, m71, 's/us/TST_concentrate_warm_wave:device:main:service:tedge-agent', ... (8 bytes))
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Sending PUBLISH to auto-65FF3E37-8DCB-5F97-81E3-35C7AFC6ADB9 (d0, q0, r0, m0, 'c8y/s/us/TST_concentrate_warm_wave:device:main:service:tedge-agent', ... (8 bytes))
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Sending PUBACK to tedge-mapper-c8y (m24, rc0)
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.409678889Z ERROR Runtime: Actor TedgeOperationConverter-13 has finished unsuccessfully: ChannelError(SendError(SendError { kind: Disconnected }))
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.409694438Z  INFO Runtime: All actors have finished
Jan 25 08:33:27 03d75d9284e7 tedge-agent[518]: 2024-01-25T08:33:27.409915012Z ERROR sm-agent: tedge_actors::runtime: Aborted due to Tedge-agent is not running the latest version
Jan 25 08:33:27 03d75d9284e7 systemd[1]: tedge-agent.service: Main process exited, code=exited, status=1/FAILURE
Jan 25 08:33:27 03d75d9284e7 systemd[1]: tedge-agent.service: Failed with result 'exit-code'.
Jan 25 08:33:27 03d75d9284e7 mosquitto[1185]: 1706171607: Received PUBACK from Cumulocity (Mid: 71, RC:0)
Jan 25 08:33:32 03d75d9284e7 systemd[1]: tedge-agent.service: Scheduled restart job, restart counter is at 1.

Case: Successful

In the succesful case, the 503,c8y_SoftwareUpdate message is being published before the tedge-agent process exits.

full success.log

Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.583704052Z  INFO tedge_agent::software_manager::actor: Current running version: 0.0.1
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBLISH from tedge-agent#te/device/main// (d0, q1, r1, m17, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (442 bytes))
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.623880614Z  INFO tedge_agent::tedge_operation_converter::actor: Waiting successful software_update operation to be cleared
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to tedge-mapper-c8y (d0, q1, r0, m63, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (442 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to tedge-agent#te/device/main// (d0, q1, r0, m11, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (442 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to auto-32D5F4C4-6D06-1993-F568-0E105D61F996 (d0, q0, r0, m0, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (442 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBACK to tedge-agent#te/device/main// (m17, rc0)
Jan 25 16:11:08 9036ddba237f mqtt-logger[623]: {"timestamp":1706199068.643524206,"message":{"tst":"2024-01-25T16:11:08.643678+0000","topic":"te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z","qos":0,"retain":0,"payloadlen":442,"payload":"{\"status\":\"successful\",\"updateList\":[{\"modules\":[{\"action\":\"install\",\"name\":\"tedge\",\"version\":\"0.13.2~405+gb83d18b\"},{\"action\":\"install\",\"name\":\"tedge-mapper\",\"version\":\"0.13.2~405+gb83d18b\"},{\"action\":\"install\",\"name\":\"tedge-agent\",\"version\":\"0.13.2~405+gb83d18b\"},{\"action\":\"install\",\"name\":\"tedge-watchdog\",\"version\":\"0.13.2~405+gb83d18b\"},{\"action\":\"install\",\"name\":\"tedge-apt-plugin\",\"version\":\"0.13.2~405+gb83d18b\"}],\"type\":\"default\"}]}"},"payload_hex":"7b22737461747573223a227375636365737366756c222c227570646174654c697374223a5b7b226d6f64756c6573223a5b7b22616374696f6e223a22696e7374616c6c222c226e616d65223a227465646765222c2276657273696f6e223a22302e31332e327e3430352b6762383364313862227d2c7b22616374696f6e223a22696e7374616c6c222c226e616d65223a2274656467652d6d6170706572222c2276657273696f6e223a22302e31332e327e3430352b6762383364313862227d2c7b22616374696f6e223a22696e7374616c6c222c226e616d65223a2274656467652d6167656e74222c2276657273696f6e223a22302e31332e327e3430352b6762383364313862227d2c7b22616374696f6e223a22696e7374616c6c222c226e616d65223a2274656467652d7761746368646f67222c2276657273696f6e223a22302e31332e327e3430352b6762383364313862227d2c7b22616374696f6e223a22696e7374616c6c222c226e616d65223a2274656467652d6170742d706c7567696e222c2276657273696f6e223a22302e31332e327e3430352b6762383364313862227d5d2c2274797065223a2264656661756c74227d5d7d"}
Jan 25 16:11:08 9036ddba237f mqtt-logger[623]: {"timestamp":1706199068.647093299,"message":{"tst":"2024-01-25T16:11:08.647150+0000","topic":"c8y/s/us","qos":0,"retain":0,"payloadlen":22,"payload":"503,c8y_SoftwareUpdate"},"payload_hex":"3530332c6338795f536f667477617265557064617465"}
Jan 25 16:11:08 9036ddba237f mqtt-logger[623]: {"timestamp":1706199068.647259799,"message":{"tst":"2024-01-25T16:11:08.647278+0000","topic":"te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z","qos":0,"retain":0,"payloadlen":0,"payload":null},"payload_hex":""}
Jan 25 16:11:08 9036ddba237f mqtt-logger[623]: {"timestamp":1706199068.647297716,"message":{"tst":"2024-01-25T16:11:08.647344+0000","topic":"te/device/main///cmd/software_list/c8y-mapper-2024-01-25T16:11:08.64443525Z","qos":0,"retain":0,"payloadlen":17,"payload":"{\"status\":\"init\"}"},"payload_hex":"7b22737461747573223a22696e6974227d"}
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from tedge-mapper-c8y (Mid: 63, RC:0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from tedge-agent#te/device/main// (Mid: 11, RC:0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBLISH from tedge-mapper-c8y (d0, q1, r0, m23, 'c8y/s/us', ... (22 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to Cumulocity (d0, q1, r0, m70, 's/us', ... (22 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to auto-32D5F4C4-6D06-1993-F568-0E105D61F996 (d0, q0, r0, m0, 'c8y/s/us', ... (22 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBACK to tedge-mapper-c8y (m23, rc0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBLISH from tedge-mapper-c8y (d0, q1, r1, m24, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (0 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to tedge-mapper-c8y (d0, q1, r0, m65, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (0 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to tedge-agent#te/device/main// (d0, q1, r0, m12, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (0 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to auto-32D5F4C4-6D06-1993-F568-0E105D61F996 (d0, q0, r0, m0, 'te/device/main///cmd/software_update/c8y-mapper-2024-01-25T16:10:05.977029922Z', ... (0 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBACK to tedge-mapper-c8y (m24, rc0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBLISH from tedge-mapper-c8y (d0, q1, r1, m25, 'te/device/main///cmd/software_list/c8y-mapper-2024-01-25T16:11:08.64443525Z', ... (17 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to tedge-mapper-c8y (d0, q1, r0, m67, 'te/device/main///cmd/software_list/c8y-mapper-2024-01-25T16:11:08.64443525Z', ... (17 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to tedge-agent#te/device/main// (d0, q1, r0, m13, 'te/device/main///cmd/software_list/c8y-mapper-2024-01-25T16:11:08.64443525Z', ... (17 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to auto-32D5F4C4-6D06-1993-F568-0E105D61F996 (d0, q0, r0, m0, 'te/device/main///cmd/software_list/c8y-mapper-2024-01-25T16:11:08.64443525Z', ... (17 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBACK to tedge-mapper-c8y (m25, rc0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from tedge-agent#te/device/main// (Mid: 12, RC:0)
Jan 25 16:11:08 9036ddba237f mqtt-logger[623]: {"timestamp":1706199068.744399050,"message":{"tst":"2024-01-25T16:11:08.744445+0000","topic":"te/device/main/service/tedge-agent/status/health","qos":0,"retain":0,"payloadlen":27,"payload":"{\"pid\":714,\"status\":\"down\"}"},"payload_hex":"7b22706964223a3731342c22737461747573223a22646f776e227d"}
Jan 25 16:11:08 9036ddba237f mqtt-logger[623]: {"timestamp":1706199068.744492759,"message":{"tst":"2024-01-25T16:11:08.744497+0000","topic":"tedge/commands/req/software/list","qos":0,"retain":0,"payloadlen":49,"payload":"{\"id\":\"c8y-mapper-2024-01-25T16:11:08.64443525Z\"}"},"payload_hex":"7b226964223a226338792d6d61707065722d323032342d30312d32355431363a31313a30382e36343434333532355a227d"}
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684253645Z  INFO tedge_agent::software_manager::actor: Installed binary version: 0.13.2~405+gb83d18b
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684292478Z  INFO tedge_agent::software_manager::actor: Self update detected. Requesting shutdown...
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684300728Z ERROR tedge_agent::software_manager::actor: Tedge-agent is no more running the latest-version => a restart is required
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684424437Z ERROR Runtime: Actor SoftwareManagerActor-11 has finished unsuccessfully: ActorError(NotRunningLatestVersion)
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684528895Z  INFO Runtime: Shutting down on error: Tedge-agent is not running the latest version
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684645104Z  INFO Runtime: Actor has finished: FsWatcher-5
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684850271Z  INFO tedge_agent::file_transfer_server::actor: Shutdown
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684872188Z  INFO Runtime: Actor has finished: HttpFileTransferServer-1
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.684930272Z  INFO Runtime: Actor has finished: RestartManagerActor-10
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.685032897Z  INFO Runtime: Actor has finished: HealthMonitorActor-14
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.685053897Z  INFO Runtime: Actor has finished: FileCacheActor-2
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.685611565Z  INFO Runtime: Actor has finished: Signal-Handler-3
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.685648023Z  INFO Runtime: Actor has finished: Downloader-6
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.685671315Z  INFO Runtime: Actor has finished: Script-12
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.685690232Z  INFO Runtime: Actor has finished: LogManager-9
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from tedge-agent#te/device/main// (Mid: 13, RC:0)
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.685710357Z  INFO Runtime: Actor has finished: Uploader-7
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.711210881Z  INFO Runtime: Actor has finished: TedgetoTeConverter-0
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.711261381Z  INFO Runtime: Actor has finished: ConfigManager-8
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.724579749Z  INFO Runtime: Actor has finished: MQTT-4
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.724754999Z  INFO mqtt_channel::connection: MQTT connection closed
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from tedge-mapper-c8y (Mid: 65, RC:0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from tedge-mapper-c8y (Mid: 67, RC:0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from Cumulocity (Mid: 70, RC:0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBLISH from tedge-agent#te/device/main// (d0, q1, r1, m18, 'te/device/main/service/tedge-agent/status/health', ... (27 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to tedge-mapper-c8y (d0, q1, r0, m69, 'te/device/main/service/tedge-agent/status/health', ... (27 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to auto-32D5F4C4-6D06-1993-F568-0E105D61F996 (d0, q0, r0, m0, 'te/device/main/service/tedge-agent/status/health', ... (27 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBACK to tedge-agent#te/device/main// (m18, rc0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received DISCONNECT from tedge-agent#te/device/main//
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Client tedge-agent#te/device/main// disconnected.
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBLISH from tedge-mapper-c8y (d0, q1, r0, m26, 'tedge/commands/req/software/list', ... (49 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to auto-32D5F4C4-6D06-1993-F568-0E105D61F996 (d0, q0, r0, m0, 'tedge/commands/req/software/list', ... (49 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBACK to tedge-mapper-c8y (m26, rc0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from tedge-mapper-c8y (Mid: 69, RC:0)
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.750192399Z  INFO Runtime: Actor has finished: TedgeOperationConverter-13
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.750259232Z  INFO Runtime: All actors have finished
Jan 25 16:11:08 9036ddba237f tedge-agent[714]: 2024-01-25T16:11:08.750389399Z ERROR sm-agent: tedge_actors::runtime: Aborted due to Tedge-agent is not running the latest version
Jan 25 16:11:08 9036ddba237f systemd[1]: tedge-agent.service: Main process exited, code=exited, status=1/FAILURE
Jan 25 16:11:08 9036ddba237f systemd[1]: tedge-agent.service: Failed with result 'exit-code'.
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBLISH from tedge-mapper-c8y (d0, q1, r0, m27, 'c8y/s/us/TST_composite_selfish_aluminum:device:main:service:tedge-agent', ... (8 bytes))
Jan 25 16:11:08 9036ddba237f mqtt-logger[623]: {"timestamp":1706199068.773179833,"message":{"tst":"2024-01-25T16:11:08.773222+0000","topic":"c8y/s/us/TST_composite_selfish_aluminum:device:main:service:tedge-agent","qos":0,"retain":0,"payloadlen":8,"payload":"104,down"},"payload_hex":"3130342c646f776e"}
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to Cumulocity (d0, q1, r0, m71, 's/us/TST_composite_selfish_aluminum:device:main:service:tedge-agent', ... (8 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBLISH to auto-32D5F4C4-6D06-1993-F568-0E105D61F996 (d0, q0, r0, m0, 'c8y/s/us/TST_composite_selfish_aluminum:device:main:service:tedge-agent', ... (8 bytes))
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Sending PUBACK to tedge-mapper-c8y (m27, rc0)
Jan 25 16:11:08 9036ddba237f mosquitto[1380]: 1706199068: Received PUBACK from Cumulocity (Mid: 71, RC:0)
Jan 25 16:11:08 9036ddba237f systemd[1]: tedge-agent.service: Consumed 13.648s CPU time.
Jan 25 16:11:13 9036ddba237f systemd[1]: tedge-agent.service: Scheduled restart job, restart counter is at 1.
didier-wenzek commented 9 months ago

The issue is due to a race condition between actors.

  1. When a software update operation has been finalized, the software_manager checks the tedge-agent version and triggers a process restart if a new version have been installed.
  2. The very last step of a software update operation, has been for the software_manager actor to forward the operation status to the tedge_operation converter which publishes this status over MQTT, the recipient being the c8y mapper.

Once the operation status has been sent by the software_manager to the tedge_operation converter, both threads are executed concurrently. The bug happens when the shutdown is effective before the status got a chance to be published over MQTT (being in the pending queue of the tedge_operation converter or of the tedge_mqtt_ext actor.

A proper fix would be to trigger the agent restart only when received the acknowledgment that the operation status has been actually published over MQTT. However, the current design / code base is not equipped to handle that easily.

A way to mitigate the issue is, on shutdown of the MQTT actor, to publish all the messages already queued for publishing. However, this is not enough as the race condition can occur only between the software_manager and the tedge_operation converter (this is the case in the failing case reported by this ticket).

A quick way to reduce the likelihood of the issue is to delay a bit the agent restart when a self-update has been detected.

gligorisaev commented 8 months ago

QA has thoroughly checked the bug and here are the results: