thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
221 stars 54 forks source link

registering commands on nested child devices results in duplicate devices when the tedge-mapper-c8y is restarted #2409

Closed reubenmiller closed 8 months ago

reubenmiller commented 11 months ago

Describe the bug

When known command is registered on a nested child device (e.g. a child of a child) via MQTT, then the definition of the child device changes to an immediate child device, when the tedge-mapper-c8y service is restarted.

To Reproduce

  1. Register an immediate child device

    tedge mqtt pub -r te/device/child-level1// '{"@type":"child-device","name":"child-level1"}'
  2. Register a nested child device (from the previous child device)

    tedge mqtt pub -r te/device/child-level2// '{"@type":"child-device","name":"child-level2","@parent":"device/child-level1//"}'
  3. Register an operation on the nested child device

    tedge mqtt pub -r te/device/child-level2///cmd/restart '{}'
  4. Verify in Cumulocity the following device hierarchy

    graph TD
        main --> child-level1 --> child-level2
  5. Restart the tedge-mapper-c8y

    sudo systemctl restart tedge-mapper-c8y
  6. Check the thin-edge.io registration message on the te/+/+/+/+ topic.

    tedge mqtt sub te/device/child-level2/+/+

    Notice that the '@parent' property is no longer part of the message for the nested child device.

    [te/device/child-level2//] {"@id":"ginger:device:child-level2","@type":"child-device","name":"child-level2"}

Expected behavior

The tedge-mapper-c8y should use the registration information on the MQTT broker over any information in the /etc/tedge/operations/c8y folder when building the entity store. This will ensure that it does not override existing registration messages with incorrect information.

Screenshots

Environment (please complete the following information):

Property Value
OS [incl. version] Debian GNU/Linux 12 (bookworm)
Hardware [incl. revision] Raspberry Pi 3 Model B Rev 1.2
System-Architecture Linux ginger 6.1.0-rpi4-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux
thin-edge.io version tedge 0.13.1~12+g5dcb430

Additional context

The root cause looks to be the mapper's interpretation of the operation files inside the /etc/tedge/operations/c8y folder which don't supported nested files.

On the first registration, the ginger:device:child-level2 folder is created under /etc/tedge/operations/c8y and the entity store knows that the ginger:device:child-level2 is a nested child devices of ginger:device:child-level1, however when the tedge-mapper-c8y is restarted, it reads the operation list, and generates registration messages which override the previous registration message.

Below is an example of the operations folder after the registration messages have been sent:

ls -l /etc/tedge/operations/c8y
total 16
-rw-r--r-- 1 tedge tedge    0 Oct 25 15:03 c8y_DownloadConfigFile
-rw-r--r-- 1 tedge tedge    0 Oct 25 15:04 c8y_LogfileRequest
-rw-r--r-- 1 tedge tedge   91 Nov  3 13:08 c8y_RemoteAccessConnect
-rw-r--r-- 1 tedge tedge    0 Oct 25 15:04 c8y_Restart
-rw-r--r-- 1 tedge tedge    0 Oct 25 15:04 c8y_SoftwareUpdate
-rw-r--r-- 1 tedge tedge    0 Oct 25 15:03 c8y_UploadConfigFile
drwxr-xr-x 2 tedge tedge 4096 Nov  3 14:19 ginger:device:child-level1
drwxr-xr-x 2 tedge tedge 4096 Nov  3 14:16 ginger:device:child-level2
albinsuresh commented 11 months ago

Disabling commands support for nested child devices was just a short term solution. Here is a proposal for a permanent solution, that re-enables command support for nested child devices as well:

  1. Do not use etc/tedge/operations/c8y directory for dynamic child device creation and just use it as a persistent store for supported operation files for all the registered entities. So, we don't introduce any new hierarchy support in this directory to accommodate nested child devices and services either. We keep the flat directory structure for all entities as the directory names(external IDs) are unique for every entity.
  2. Disable the inotify mechanism that generates the supported operations message (114), whenever this directory is updated. This will prevent the automatic creation of child devices in the cloud, with unexpected metadata. The 114 message will be sent only when a new capability message is received from an entity, by aggregating all the operations listed under its operations directory. Sending the 114 message as part of the capability message processing will ensure that the entity is registered first, with the right metadata, before its supported operations list is updated.
  3. Introduce a new shared directory to keep the custom operation definition files, in the same format as it is today. When a device declares that it supports that custom operation using the corresponding capability message (e.g: te/device/<device-id>///cmd/c8y_Command) with an empty payload, a symlink to the main supported operation definition file will be added to the corresponding device's operations directory.
  4. If a device wants to have its own operation file definition instead of reusing a pre-configured one, it should first upload its operation definition file to the file-transfer repository and then provide that URL in the capability message that it sends, so that the mapper can move that uploaded file into that device's operations directory (This might be an optional requirement and can be deferred for later).
didier-wenzek commented 11 months ago

Disabling commands support for nested child devices was just a short term solution. Here is a proposal for a permanent solution, that re-enables command support for nested child devices as well:

Points 1 and 2 make sense.

  1. Introduce a new shared directory to keep the custom operation definition files, in the same format as it is today. When a device declares that it supports that custom operation using the corresponding capability message (e.g: te/device/<device-id>///cmd/c8y_Command) with an empty payload, a symlink to the main supported operation definition file will be added to the corresponding device's operations directory.

When are these files used? What's their purpose?

  1. If a device wants to have its own operation file definition instead of reusing a pre-configured one, it should first upload its operation definition file to the file-transfer repository and then provide that URL in the capability message that it sends, so that the mapper can move that uploaded file into that device's operations directory (This might be an optional requirement and can be deferred for later).

Why an URL in the capability message? Why not directly the expected payload?

reubenmiller commented 10 months ago

Since https://github.com/thin-edge/thin-edge.io/pull/2466 has been merged, this ticket is unblocked.

albinsuresh commented 9 months ago
  1. Introduce a new shared directory to keep the custom operation definition files, in the same format as it is today. When a device declares that it supports that custom operation using the corresponding capability message (e.g: te/device/<device-id>///cmd/c8y_Command) with an empty payload, a symlink to the main supported operation definition file will be added to the corresponding device's operations directory.

When are these files used? What's their purpose?

The idea was to reuse a single custom command definition file across all the child devices that supports it.

  1. If a device wants to have its own operation file definition instead of reusing a pre-configured one, it should first upload its operation definition file to the file-transfer repository and then provide that URL in the capability message that it sends, so that the mapper can move that uploaded file into that device's operations directory (This might be an optional requirement and can be deferred for later).

Why an URL in the capability message? Why not directly the expected payload?

Since the operation file content is TOML, I thought it would be weird to include that in the cmd payload which is generally JSON. So, uploading the TOML file content first, followed by the command message with this URL in the payload was a way to get around that.

That being said, both points 3 and 4 can be ignored as this was a proposal to enable custom operation support for child devices before the workflow feature was in-place. Since workflows is the way forward for cloud agnostic operation support, there's no point in introducing yet another cloud-specific mechanism for the same. So, we will focus only on points 1 and 2 for now.

albinsuresh commented 9 months ago

Reaffirming the scope of this ticket:

  1. The etc/tedge/operations/c8y directory won't be used for dynamic child device creation and just use it as a persistent store for supported operation files for all the registered entities. We keep the flat directory structure for all entities (immediate and nested entities) as the directory names(external IDs) are unique for every entity.
  2. Disable the child device list reconciliation mechanism on mapper startup that checks the difference between the child devices retrieved from the cloud and the child device directories present under etc/tedge/operations/c8y, and registers the child devices for any extra child device directories found. This will prevent the directories of nested child devices being wrongly interpreted as immediate child devices.
  3. On startup, the supported operations list (114) derived from the contents of the operations dir would still be sent for the main device, but not for the child devices, to avoid these 114 messages accidentally registering those devices implicitly. The supported operation list for the child devices would only be published when their device registration messages, or capability messages are processed.
  4. The inotify mechanism to dynamically detect updates to the operations directory would also be limited to the main device only, to avoid the same implicit child device registration in the cloud, before those entities are explicitly registered.
didier-wenzek commented 9 months ago

What is the source of truth? The retained MQTT messages on te/+/+/+/+? The persisted version of the entity store? The /etc/tedge/operations/c8y directory?

I dislike the idea of an asymmetric treatment for the child devices compared to the main one. This will be a continuous source of issues.

Also, beware that this PR, related to operation declaration might interact with this issue.

albinsuresh commented 9 months ago

What is the source of truth? The retained MQTT messages on te/+/+/+/+? The persisted version of the entity store? The /etc/tedge/operations/c8y directory?

Until now, there were two sources of truth. The operations directory and the retained messages on MQTT broker, both trying to sync with each other. With this work, we are eliminating the ops directory as a source of truth in a way that it just reflects the mapper's view of supported ops for each entity.

The persisted entity store is not a source of truth but just reflects the in-memory entity store. Even when the mapper restores its state from this file on startup, it is updated when the retained messages arrive from the broker, if those retained messages contains anything new.

I dislike the idea of an asymmetric treatment for the child devices compared to the main one. This will be a continuous source of issues.

It would have been nice to eliminate this API completely, including for the main device. But unfortunately, we're forced to keep it for now, as many customers are using this mechanism to add custom operations support to the main device. Once the workflow APIs are made available to the end-users, then they'll be able to migrate their existing custom operation mechanism to workflows and then we can get rid of this completely.

albinsuresh commented 8 months ago

Test Plan

The reported issue is already tested in bootstrap.robot::Mapper restart does not alter device hierarchy. The fix also removed the following features:

  1. Creating a child device directory under /etc/tedge/operations/c8y/<child-id> does not create the same child device in the cloud anymore.
  2. Adding a new operation to a child device by creating a file in the child device operations directory is not supported.
  3. The only mechanism for a child device to register an operation is by declaring a capability message via te/device/<child-id>///cmd/<cmd-id> topic which will result in the creation of the corresponding operation file in the child device ops directory and an updated operation list message with the <cmd-id> sent to C8y.
  4. Removing a supported operation from a child device by simply removing the operation file from the operations directory is not supported anymore.
gligorisaev commented 8 months ago

QA has thoroughly checked the bug and here are the results: