thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
221 stars 54 forks source link

Improve error reporting on workflow definition error #3074

Closed didier-wenzek closed 1 month ago

didier-wenzek commented 2 months ago

Is your feature improvement request related to a problem? Please describe.

If for some reason a workflow definition is rejected by tedge-agent, the workflow is rejected on start and an error is properly reported in the agent log:

2024-08-21T13:51:20.195143529Z ERROR sm-agent: tedge_agent::operation_workflows: Ignoring operation workflow definition from "/etc/tedge/operations/firmware_update.toml": Parsing TOML content

Caused by:
    Unknown action: restart

The commands for that operation (in that case firmware_update) are then ignored. This can be seen in the agent log:

2024-08-21T14:17:11.075445799Z  INFO tedge_agent::operation_workflows::actor: Ignoring firmware_update operation which is not registered

So far so good. However, the error is not reported to the clients and the operation will stay on Cumulocity in the pending state for ever with no clue for the end-user.

Describe the solution you'd like

When a workflow definition for an operation has been given to the agent but rejected because ill-formed, then the agent must not ignored the commands for that operation but fail those reporting the root cause.

Describe alternatives you've considered

Another approach could have been to let tedge-agent fail hard when parsing an ill-formed workflow definition. However, this would be way too fragile. It's better for the agent to run is a degraded mode. For instance, the Unknown action: restart has been caused not by a user mistake but by an agent upgrade, the new version using a new syntax.

Additional context

gligorisaev commented 2 months ago

Reviewed https://github.com/thin-edge/thin-edge.io/pull/3079 and the test included in that PR, also checked for flakiness, the improvement seems to be implemented as described