oam-dev / spec

Open Application Model (OAM).
https://oam.dev
Other
3.04k stars 246 forks source link

Proposal for Hydra Edge Extensions #181

Open jiria opened 5 years ago

jiria commented 5 years ago

Hydra Edge Extensions

Cloud native applications are used as a synonym to distributed applications that can infinitely scale to meet the needs of a Cloud deployed services. The problem of scaling distributed applications is however not specific to Cloud environments. To enable the Intelligent Edge, these distributed applications need to run and scale directly on the Edge. Unlike Cloud applications which scale based on the incoming requests, Edge applications scale based on the environment they interact with. For example, a security camera with no motion detected does not require any additional compute for processing. However, when motion is detected, additional compute is needed to understand what objects are causing the motion and if a person is detected, additional processing is scheduled to run facial recognition. There are only limited compute resources on the Edge and so the Edge applications need to scale up and down based on demand to make sure they do not starve each other of the resources.

Device Capabilities

Hydra introduces a new application model that allows for composition of Cloud native applications. For the Cloud, these applications can leverage resources available in the Cloud, such as cpu, memory, storage and gpu. On the Edge, these resources are available as well, but there is more. For example, a temperature sensor might be attached to a compute node. In this case, a Component of the application is specifically designed to read the input from the temperature sensor and act on it. In order for this temperature reading Component to work, it needs access to the temperature sensor. More generally, the Component needs access to specific Device Capabilities. A temperature sensor, an actuator, a camera, a TPM chip are all examples of Device Capabilities. Device Capability could be connected to a compute device over USB for example (we refer to this as Dedicated Device Capability), or it can communicate over some shared bus such as Wi-Fi (we refer to this as Shared Device Capability). In the Dedicated Device Capability case, the temperature reading Component needs to be scheduled to run on the compute device the temperature sensor is attached to. This can be accomplished using Extended Resources annotation on the Component. In the Shared Device Capability case, the temperature reading Component needs to be scheduled to run on a compute device that has access to the bus which can communicate with the temperature sensor. To generalize the Shared Device Capability case, we need to be able to control where a Component gets scheduled to run. This is not specific to Edge either. For example, in the Cloud, a Component which requires a lot of disk I/O would run better if it could be scheduled on compute resources with SSD disks. Hydra currently does not allow for targeting where a Component is scheduled to run besides leveraging Resources. One could argue that having an SSD could be exposed as an Extended Resource, but not every attribute can be.

At the Edge, we need to deal with more complicated cases besides those mentioned above as well. For example, a compute node is a robot and has three attached cameras. While we could refer to these as Extended Resources, the topology of the cameras is important. Perhaps the first camera is a back camera, the second camera is a left camera and the third camera is a right camera. The position and orientation of the cameras is important to the Application Developer, so we cannot just attach the three cameras using Extended Resources to the Component without additional context.

We propose to extend Hydra traits with NodeSelector and DeviceBinding traits to solve the aforementioned problems.

NodeSelector Trait

The NodeSelector trait is a Standard trait, since a true serverless runtime might not support a concept of compute nodes. The NodeSelector trait would allow the Application Operator to more precisely control what compute nodes should be considered for scheduling of the attached component. The NodeSelector trait would contain a filter expression that would be passed through to the runtime scheduler.

The NodeSelector trait can be used to target deployment to nodes with SSDs or which have access to a bus exposing specific Shared Device Capabilities.

DeviceBinding Trait

The DeviceBinding trait is a Standard trait, since Cloud based runtimes might not require the concept of late device binding. The DeviceBinding trait is used when the Application Developer requires certain Device Capabilities, but does not know upfront how these capabilities are exposed. It is the responsibility of the Application Operator to bind the actual device capabilities to the resource requirements expected by the Component. The Application Operator leverages the DeviceBinding trait to accomplish that.

The DeviceBinding trait can be used along with the NodeSelector trait to solve the three cameras problem. In this case, we can leverage NodeSelector trait to schedule a Component that would control the robot to run on the robot. The Component is using Extended Resources to express the need to have access to left and right cameras. The Application Developer does not know upfront which cameras to use as the left or right cameras, as that this specific to the model of the robot. A different robot model might have different camera topology. The Application Operator uses DeviceBinding trait as part of the Application Configuration to bind the specific cameras to the Extended Resources expressed by the Component. This works well on the Edge, as the Application Operator understands well what nodes are available, where the application should run and what Device Capabilities it needs access to.

hongchaodeng commented 4 years ago

What a nice proposal!

I don't quite understand this sentences:

The DeviceBinding trait is used when the Application Developer requires certain Device Capabilities, but does not know upfront how these capabilities are exposed. It is the responsibility of the Application Operator to bind the actual device capabilities to the resource requirements expected by the Component.

Could you provide more examples? Thanks!

jiria commented 4 years ago

Thanks Hongchao.

The example I had mind: A robot might have three physical cameras attached, exposed as /dev/video0, /dev/video1 and /dev/video2. They correspond to left, right, and back camera. The Application Developer would not know which camera is the back camera and which are used for left and right. These might be exposed as different devices (/dev/video*) as well on different models of robots. As such, the Application Operator would bind the "Left Camera" Extended Resource with the /dev/video0 as part of the Application Configuration. This would be the "late device binding". Let me know and I can provide more examples.

Little aside, I am starting vacation today, so my responses will be sporadic, but Rachit on my team will take over.