thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
221 stars 54 forks source link

Support Cumulocity IoT c8y_Firmware operation handling for child devices #1696

Closed reubenmiller closed 1 year ago

reubenmiller commented 1 year ago

Is your feature request related to a problem? Please describe.

There is no mechanism currently available to support firmware update operations on child devices.

Currently a custom operation handler can be written for the main device, however the child devices do not have such support.

Describe the solution you'd like

Disclaimer: The implementation focuses on support for the Cumulocity IoT c8y_Firmware operation for child devices only!

The support for the c8y_Firmware operation for child devices should follow a very similar flow as the configuration management for child devices, with the addition of sending one extra smart rest message via MQTT before transitioning the operation to SUCCESSFUL.

For initial implementation the firmware operation handler should be implemented as a new service called c8y-firmware-operation. This is subject to change in the future after the refactoring ticket is complete.

The child device feature should be activated by:

The flow is:

  1. Receive the firmware SmartREST message: 515,myChild-1,ubuntu core,20.04.3,http://test.com, where it follows the schema of 515,{device-id|child-id},{firmware_name},{firmware_version},{firmware_url}

  2. Download the firmware url, and make it available via the local http server (the same used by the c8y-configuration-plugin). The file should be stored in a local file cache location (outside of direct view of the http server, e.g. /var/tedge/cache). A symlink should be created under the child device structure (under /var/tedge/file-transfer) which links it to the file-cache location.

  3. Publish a MQTT message on the topic tedge/${CHILD_SN}/commands/req/firmware_update, with the following payload:

    {
        "id": "{request_id}",
        "attempt": 1,                                           // Attempt number starting from 1
        "name": "{firmware_name}",
        "version": "{firmware_version}",
        "sha256": "{sha256_of_firmware_file}",
        "url":"http://${PARENT_IP}:${HTTP_PORT}/tedge/file-transfer/${child-id}/firmware_update/${file-cache-key}"
    }

    Note

    • ${file-cache-key} is the sha256 checksum of the .url string (as received from the server), e.g. http://test.com. This is used to uniquely identify if the file exists in the local file cache or not.
    • {request_id} is a unique identifier for the operation. The request id should be used in all corresponding replies from the child device connector

    The local firmware url should be the url where it can be downloaded via the local http server. The local http server url should include the child id in it, however it should just use symlinks to link the downloaded file to the applicable child device. If the file is to be applied to multiple child devices, then the there will be 1 symlink per child device, and all the symlinks will be referencing the same file (which is stored in the file cache area, outside of the http server view). Using this structure makes it easier to write ACL rules for the HTTP server based on child id, as it can be purely URL path based, plus it also ensures that the same file is available to multiple child devices without copies the same file (reducing the disk space usage required)

  4. The child device connector should send the following optional message to indicate that the firmware operation is being processed by the child device. The following MQTT message should follow the schema of:

    Topic

    tedge/${CHILD_SN}/commands/res/firmware_update

    Payload

    {
        "id": "{request_id}",
        "status": "executing"
    }

    On receiving this message, tedge should send an MQTT message to the c8y/s/us/<child-id> topic with the payload 501,c8y_Firmware to indicate that the operation is being processed.

  5. The child device connector then sends either a success or failed message to indicate if the firmware operation was successful or not.

    If the child device connector only sends a successful/failed operation before tedge has received an executing message, then tedge should send the 501,c8y_Firmware MQTT message automatically. This is the same handling as what is already implemented in the child configuration management. The idea is to make it easier on the child connector implementation by reducing the amount of mandatory messages whilst still allowing finer grain control.

    If Successful

    When successful, the device connector should send the following message to the tedge/${CHILD_SN}/commands/res/firmware_update topic.

    {
        "id": "{request_id}",
        "status": "successful"
    }

    If Failed

    When failed, the device connector should send the following message to the tedge/${CHILD_SN}/commands/res/firmware_update topic.

    {
        "id": "{request_id}",
        "status": "failed",
        "reason": "Child device driver failed due to an unknown error"
    }

    Note

    If the child device connector does not send back either successful, failed or executing (e.g. an invalid status) then the operation should be treated as failed. Though the comparison of the status fragment should be case insensitive to make the tedge handling more developer friendly.

  6. Depending on the response received by the following step (from the child device connector), one of the following steps should be executed by firmware plugin.

    1. If the operation was successful, then the following smart rest message should be sent to indicated to Cumulocity IoT that the firmware name/version/url has now changed on the device

      c8y/s/us/<child-id> with a payload of 115,{firmware_name},{firmware_version},{firmware_url}. Note the firmware_url is the original url received from Cumulocity IoT, not the local url!

      Then transition the operation to successful by sending a MQTT message to the c8y/s/us/<child-id> topic with the payload 503,c8y_Firmware.

    2. If the operation was not successful, then transition the operation to failed without any additional MQTT messages.

      Send a message to the c8y/s/us/<child-id> topic with the payload 502,c8y_Firmware,{failure_reason}. The failure reason should be provided by the child device connector from the previous step under the reason property. If no reason is given then a default reason should be used, e.g. unknown error. The child device connector did not specify the error reason, or something to that effect.

      Example payload

      502,c8y_Firmware,Failed to download the firmware artifact. Permission denied

Configuration

The default timeout for the firmware update can be changed via the tedge.toml file

sudo tedge config set firmware.child.update.timeout <seconds>

# Example, change the timeout to 2 hours
sudo tedge config set firmware.child.update.timeout 7200

Additional constraints

Describe alternatives you've considered

No other alternative solution was considered, as the child device firmware operation support follows the same design/api as the management support for child devices.

Additional context

The correct sequence of the Cumulocity IoT Firmware update support is detailed in the following link:

didier-wenzek commented 1 year ago

I have some questions.

Receive the firmware SmartREST message: 515,ubuntu core,20.04.3,http://test.com, where it follows the schema of 515,{firmware_name},{firmware_version},{firmware_url}

The child device id seems to be missing. Isn't it?

c8y/s/us with a payload of 115,{firmware_name},{firmware_version},{firmware_url}.

Again no child device id. Is this id required by Cumulocity? Is the firmware operation independent of the child device?

The timeout of the firmware operation should be configurable via the c8y_Firmware operation file. The default operation timeout should be 6 hours (as firmware operation generally can take longer to apply).

Is this 6 hours between "executing" and "success/failure" or is this 6 hours to get a first reaction from the child device?

Firmware artifact caching.

When is the file deleted? Do we try to avoid to download twice a firmware to be installed on 2 devices in a row?

reubenmiller commented 1 year ago

I have some questions.

Receive the firmware SmartREST message: 515,ubuntu core,20.04.3,http://test.com, where it follows the schema of 515,{firmware_name},{firmware_version},{firmware_url}

The child device id seems to be missing. Isn't it?

Yes, you're right, I forgot to put the child topic as c8y/s/us/<child-id>. I have corrected this in the original ticket description.

c8y/s/us with a payload of 115,{firmware_name},{firmware_version},{firmware_url}.

Again no child device id. Is this id required by Cumulocity? Is the firmware operation independent of the child device?

Yes, again that was a mistake on my part. I have fixed it to include the proper second field which is the target external id of the device, e.g. device-id or child-id.

The timeout of the firmware operation should be configurable via the c8y_Firmware operation file. The default operation timeout should be 6 hours (as firmware operation generally can take longer to apply).

Is this 6 hours between "executing" and "success/failure" or is this 6 hours to get a first reaction from the child device?

For the first implementation I would say that it is the timeout between any child device communication, whether it be in between the initial 'set-to-executing' message, to the successful/failure message. The key would be to make the timeout configurable (though I am also open to having two different timeouts if necessary).

Firmware artifact caching.

When is the file deleted? For first implementation I would not worry about deleting the artifact as a simple cronjob could be written to delete them after they are x days old. We would need a more robust artifact retention concept before we could implement a wider feature.

Do we try to avoid to download twice a firmware to be installed on 2 devices in a row?

Good question, yes we should avoid the same component downloading the same artifact from two child devices (since the cache is only checked for a completed download). Though we could post-pone more advanced download/caching topics for a second phase (e.g. limit number of parallel clients downloading artifacts, automatic cache eviction etc.)

reubenmiller commented 1 year ago

After a discussion with @rina23q, the following proposal was made to control which files are exposed by the http server.

Proposal for the http file structure

File cache schema

/var/tedge/cache/${cache_key}
/var/tedge/file-transfer/${CHILD_ID}/firmware_update/${cache_key}

cache_key is the unique checksum of the file (e.g. sha256 of the url string)

Example

File cache structure

The following files are NOT directly accessible by the http server, they are only exposed via symlinks.

/var/tedge/cache/
|_ aaaaaaa
|_ bbbbbbb
|_ ccccccc

Example file-transfer symlink

/var/tedge/file-transfer/child01/firmware_update/aaaaaaa  (symlink to /var/tedge/cache/aaaaaaa)
/var/tedge/file-transfer/child02/firmware_update/aaaaaaa  (symlink to /var/tedge/cache/aaaaaaa)

Try it out using a manual symlink

ln -s /var/tedge/cache/aaaaaaa  /var/tedge/file-transfer/child01/firmware_update/aaaaaaa
rina23q commented 1 year ago

Try it out using a manual symlink

ln -s /var/tedge/cache/aaaaaaa  /var/tedge/file-transfer/child01/firmware_update/aaaaaaa

And this approach should work :+1: I did a quick try. Could get the content of the original file via GET request.

reubenmiller commented 1 year ago

Try it out using a manual symlink

ln -s /var/tedge/cache/aaaaaaa  /var/tedge/file-transfer/child01/firmware_update/aaaaaaa

And this approach should work 👍 I did a quick try. Could get the content of the original file via GET request.

I've updated the ticket description to reflect this approach