project-akri / akri

A Kubernetes Resource Interface for the Edge
https://docs.akri.sh/
Apache License 2.0
1.11k stars 144 forks source link

Akri architecture for IoT protocols with standalone devices #348

Closed jiayihu closed 3 years ago

jiayihu commented 3 years ago

As requested during the meeting, I'm opening this issue to discuss how Akri should deal with IoT protocols where the device is not associated with any cluster node. My point of view on the matter is that the current Akri architecture is heavily influenced by the K8s Device Plugin API. However, such API was meant more for use cases like taking dispatching a Pod to a node that has a GPU or SSD, hardware attached to the node itself. Edge devices, on the other hand, can live independently from the nodes and, actually, most of them do.

During the implementation of CoAP and MQTT, I think we are starting to see the consequences of such mismatch in between the behaviour expected by Akri/Kubernetes Device Plugin API and the IoT protocols. Akri expects a 1:1 relationship between node and device, whereas most IoT devices don't have any kind of relationship to any node per se. We would just like the devices to be known within the cluster (I'm leaving out further discussions about listing the device vs its resources/topics).

For instance, with CoAP, this results in an agent + discovery handler per node and each of them sends discovery requests to the CoAP devices using IP endpoints. The latter are bombarded with discovery requests from each node. A second issue is that each discovery handler thinks it has discovered a new device because if two discovery handlers send a discovery request to the same device, both will receive a response and will believe to have discovered a distinct device. The same device would be listed twice.

These are two quick issues that have come to my mind but do not consider them as an exhaustive list. I would not suggest focusing too much on the two issues, they are just as example. I think there's an underlying issue with the assumptions Akri does about the edge.

This discussion is not easy, we may even state that the primary issue is that Kubernetes was not designed for the edge, which has unique characteristics compared to the cloud. On the other hand, it is highly desirable for many use cases to have the K8s capabilities in scaling, availability, orchestration when working with the edge. K8s has proved to be mature for workload orchestration, but device discovery and orchestration is a whole new challenge for the system. And that's why we're discussing it.

kate-goldenring commented 3 years ago

Hi @jiayihu. Thanks for starting this discussion. For your first question, I think Akri can support the following issue of bombarding devices with aliveness checks with changes in how discovery handlers are deployed. For your second, I believe akri already solves the issue of establishing distinct devices as Akri Instances.

For instance, with CoAP, this results in an agent + discovery handler per node and each of them sends discovery requests to the CoAP devices using IP endpoints.

This is a deployment strategy. I could see us discussing changes in it. Maybe discovery handlers are jobs. They discover the devices, tell the agent and then go away. There would be no way to detect if devices go offline unless the discovery handler is re-deployed. This way the devices are not being continually pinged. Or one discovery handler in the cluster could keep polling the device. If it sees the device go offline it can tell the Agent, the Agent will delete the instance, which will result in all agents deleting their device plugins for the device.

A second issue is that each discovery handler thinks it has discovered a new device because if two discovery handlers send a discovery request to the same device, both will receive a response and will believe to have discovered a distinct device. The same device would be listed twice.

This should not happen. There should be one instance created per device. Akri handles this by expecting that discovery handlers give id's for each device that is specific to the device so there would be one instance created per device. so for onvif, the id is mac address. There are device plugins made for each node for the device, so each node can advertise that it can see the device. Just as nodes advertise how much CPU they have, this allows nodes to advertise whether they have an ip camera, for example.

While Device Plugin was made for static hardware rather that shared IoT devices, we have architected Akri specifically with the aim of tailoring DP for IoT devices. It is the Akri Instance that enables that abstraction of a shared IoT device.

bfjelds commented 3 years ago

There should be one instance created per device

This is underpinned by an assumption that each device has a unique way to identify itself. For ip cameras, we use <ip>.<mac> or just <mac>. This gets converted to the instance hash suffix, which allows multiple nodes to "share" a single camera.

If an IoT device does not have a unique identifier, our sharing concept breaks down.

jiayihu commented 3 years ago

This is a deployment strategy. I could see us discussing changes in it. Maybe discovery handlers are jobs. They discover the devices, tell the agent and then go away. There would be no way to detect if devices go offline unless the discovery handler is re-deployed. This way the devices are not being continually pinged. Or one discovery handler in the cluster could keep polling the device.

I think both approaches could work. AFAIK, the discovery handler is already called periodically, so it is working basically as a Job. An idea is that the Akri controller schedules the Job periodically or is it mandatory for the communication to be between agents and discovery handler? If we remove the association between device and node, what's the role of the agent? Why do we need to have an agent on each node? On the other hand, we need to continue supporting devices that are attached to a node actually. Would be possible to deploy a Job on each node so that they gather the node devices?

There are device plugins made for each node for the device, so each node can advertise that it can see the device.

Does it mean that the device would be listed multiple times? I can't test it right now because I don't have my Rasp cluster up, but I seem to remember that my device resulted in 3 instances because I had three agents/nodes. What happens when 2 distinct discovery handlers return the same device as a result of using the IP as identifier? The answer lies in the following code:

// For local devices, include node hostname in id_to_digest so instances have unique names
    if !shared {
        id_to_digest = format!(
            "{}{}",
            &id_to_digest,
            query.get_env_var("AGENT_NODE_NAME").unwrap()
        );
    }

So probably I was having multiple instances of the same device because I used to set the protocol as not shared. But the doc mentions that shared means that the device is accessible by all the nodes. Which broker is used then when an application wants to communicate with the device?

If an IoT device does not have a unique identifier, our sharing concept breaks down.

If think that the assumption is sound. Identifying IoT devices is indeed a research question in IoT, since devices can lose connection and change IP. But I think it's reasonable to expect each device to be identifiable, no matter what standard is used.

kate-goldenring commented 3 years ago

AFAIK, the discovery handler is already called periodically, so it is working basically as a Job.

By Job I mean that the DH pod terminates. Also, the DH can decide how often it polls the device, so for COAP for example, that interval could be (the loop in the discover function could sleep for) 10 minutes.

If we remove the association between device and node, what's the role of the agent? Why do we need to have an agent on each node?

The agent on each node is to create Device Plugins, which create extended resources. Here is the K8s Documentation on Device Plugins.

On the other hand, we need to continue supporting devices that are attached to a node actually. Would be possible to deploy a Job on each node so that they gather the node devices?

A device plugin needs to be made on each device whether the device is local or network based. So an agent needs to be on each node (it creates the device plugins) and DH needs to at somepoint as well (or be imbedded in the Agent).

I used to set the protocol as not shared

COAP is discovering shared devices. Since you didnt set shared to true, the id was also including the node name, making the same device have multiple instances.

kate-goldenring commented 3 years ago

@jiayihu and @shantanoo-desai, does this discussion answer questions you had about Akri's support of IoT devices? Should I close the issue?

kate-goldenring commented 3 years ago

Please re-open to continue discussion if questions have not been answered!