margo / specification

Margo Specification
https://specification.margo.org/
Other
16 stars 4 forks source link

Proposal/Discussion: Describe the device capabilities specification #9

Open ajcraig opened 1 month ago

ajcraig commented 1 month ago

Below I have outlined a proposal for the Device Capability Specification that is utilized by the Workload Orchestration Software. This file informs the WOS of the properties, resources, and components of the Margo compliant devices.

The associated workflow / use case for this is detailed below:

  1. Device Owner creates the Device Capability file associated with the device and stores it locally.
  2. Workload Orchestration Software enrolls the Workload Orchestration Agent into their platform.
  3. During the enrollment process, the agent sends the device capabilities specification to the WOS.
  4. This information is then utilized by the WOS to ensure compatibility with applications, inform the user of the device's information, and allows the WOS to manage the usage of the device to ensure it isn't over capacity.

Proposed Margo Device Capability Specification

kind: devicecapabilityspecification
properties:
    deviceId: 
    deviceVendor:
    modelNumber:
    serialNumber:
    margoDeviceRole:
        role1:
    deviceResources:
        cpuArchitecture:
        cpuModel:
        vcpuCount:
        cpuFrequency: 
        memoryCapacity: 
        storageCapacity:
    devicePeripherals:
        periferal1: 
            - name:
            - type: 
            - modelNumber: 
            - description:
        periferal2:
            - name:
            - type:
            - modelNumber:
            - description:
    deviceInterfaces:
        interface1:
            - name:
            - type: 
            - modelNumber: 
        interface2:
            - name:
            - type:
            - modelNumber: 

Device Capability Attributes

Top-level Attributes

Attribute Type Required? Description
properties Properties Y Metadata element specifying characteristics about the device. See the Properties section below.

Properties Atrributes

Attribute Type Required? Description
deviceID string Y Unique deviceID assigned to the device via the Device Orchestration Software.
deviceVendor string Y Defines the device vendor.
modelNumber string Y Defines the model number of the device.
serialNumber string Y Defines the servial number of the device.
margoDeviceRole Margo Device Role Y Spec element that defines the device role it can provide to the Margo environment. See the Margo Device Role section below.
deviceResources Device Resources Y Spec element that defines the device's resources available to the application deployed on the device. See the Device Resources section below.
devicePeripherals Device Peripherals Y Spec element that defines the device's peripherals available to the application deployed on the device. See the Device Peripherals section below.
deviceInterfaces Device Interfaces Y Spec element that defines the device's interfaces that are available to the application deployed on the device. See the Device Interfaces section below.

Margo Device Role Attributes

Attribute Type Required? Description
role string Y Defines the device role(s) it can provide to the Margo environment.the device can represent within Identifier of the version of the API the object definition follows.

Device Resources

Attribute Type Required? Description
cpuArchitecture string Y Defines the CPUs architecture. i.e. ARM/x86.
cpuModel string Y Defines the CPU Model of the device.
vcpuCount integer Y Defines the vCPU count available on the device.
cpuFrequency integer Y Defines the frequency of the CPU.
memoryCapacity integer Y Defines the memory capacity available for applciations on the device. This MUST be defined in MBs
storageCapacity integer Y Defines the storage capacity available for applications to utilize. This MUST be defined in MBs.

Device Peripherals

Attribute Type Required? Description
peripheral Peripheral Y Defines a peripheral that is present on the edge device. Can be one to many described in this section. See the Peripheral Attributes section below.

Peripheral Attributes

Attribute Type Required? Description
name string Y Name of the peripheral.
type string Y Type of the peripheral. i.e. GPU
modelNumber string Y Model number of the peripheral.
description string Y Description of the peripheral which can be used to describe within the WOS GUI.

Device Interfaces

Attribute Type Required? Description
interface Interface Y Defines a interface that is present on the edge device. Can be one to many described in this section. See the Interface Attributes section below.

Interface Attributes

Attribute Type Required? Description
name string Y Name of the interface.
type string Y Type of the interface. i.e. Ethernet NIC,
modelNumber string Y Model number of the interface.
description string Y Description of the interface which can be used to describe within the WOS GUI.
ajcraig commented 1 month ago

@margo/technical-wg Please review the proposal above.

Thanks!

phil-abb commented 1 month ago

Here's a proposed set of modifications

apiVersion: device.margo/v1
kind: DeviceCapability
properties:
  id: 
  vendor:
  modelNumber:
  serialNumber:
  roles:
  resources:
    cpu:
      architecture:
      model:
      cores:
      frequency: 
    capacity:
      memory: 
      storage:
  peripherals:
    - name: 
      type: 
      modelNumber: 
      properties:
  interfaces:
    - name: 
      type: 
      modelNumber: 
      properties:

Simple Example

apiVersion: device.margo/v1
kind: DeviceCapability
properties:
  id: northstarida.xtapro.edge
  vendor: Northstar Industrial Applications
  modelNumber: 332ANZE1-N1
  serialNumber: PF45343-AA
  roles:
    - standalone cluster
    - cluster lead
  resources:
    cpu:
      architecture: Intel x64
      model: i9-14900KS
      cores: 24
      frequency: 6.2 GHz
    capacity:
      memory: 64.0 GB
      storage: 2 TB
  peripherals:
    - name: NVIDIA GeForce RTX 4070 Ti SUPER OC Edition Graphics Card 
      type: GPU  
      modelNumber: TUF-RTX4070TIS-O16G
      properties:
        manufacturer: NVIDIA
        series: NVIDIA GeForce RTX 40 Series
        gpu: GeForce RTX 4070 Ti SUPER
        ram: 16 GB
        clockSpeed: 2640 MHz
  interfaces:
    - name: RTL8125 NIC 2.5G Gigabit LAN Network Card
      type: Ethernet
      modelNumber: RTL8125 
      properties:
        maxSpeed: 2.5 Gbps
    - name: WiFi 6E Intel AX411NGW M.2 Cnvio2
      type:  Wi-Fi
      modelNumber: AX411NGW
      properties:
        bands: ["2.4 GHz", "5 GHz", "6GHz"]
        maxSpeed: 2.4 Gbps
phil-abb commented 1 month ago

For peripherals.type and interfaces.type I think we'll need to make a list of types that should be used. We may not be able to come up with a complete list but we should try to come up with as many as we can think of so people are using the type consistently. If we don't, trying to match up the requirements is going to be too difficult if it's intended to be automatable.

for anything with a unit (GB, TB, GHz, Mhz, Gbps, etc.) we'll probably need to define what the unit should be (or at least a consistent naming convention) if the intention is to try to pair these resources requirements with the application's requirements. If we don't, trying to match up the requirements is going to be too difficult if it's intended to be automatable.

phil-abb commented 1 month ago

Do we need to include any information about the container platform it is running (e.g., Docker, Podman, Kubernetes distribution, version, etc.)?

julienduquesnay-se commented 1 month ago

In the resource section, "cpu" and "capacity" are not at the same semantic level ("cpu" being a device and "capacity" a characteristic). I think they should be separated. Some platforms can provide multiple CPU, with different characteristics, or different types of core in the same SOC. For example, some Arm SOC have Core A and Core R and some Intel CPU have core P and core E. Also the frequency can be changed programmatically (and sometimes by core), so I would use maxFrequency.

I would propose to:

make "cpu" its own section, with a list to allow multiple cpu, but also a list to allow different cores replace"frequency" with "maxFrequency"

apiVersion: device.margo/v1 kind: DeviceCapability properties: id: vendor: modelNumber: serialNumber: roles: resources: capacity: memory: storage: cpu:

General


From: Philip Presson @.> Sent: Thursday, June 6, 2024 09:43 To: margo/specification @.> Cc: Julien Duquesnay @.>; Team mention @.> Subject: Re: [margo/specification] Proposal/Discussion: Describe the device capabilities specification (Issue #9)

[External email: Use caution with links and attachments]


Here's a proposed set of modifications

apiVersion: device.margo/v1 kind: DeviceCapability properties: id: vendor: modelNumber: serialNumber: roles: resources: cpu: architecture: model: cores: frequency: capacity: memory: storage: peripherals:

Simple Example

apiVersion: device.margo/v1 kind: DeviceCapability properties: id: northstarida.xtapro.edge vendor: Northstar Industrial Applications modelNumber: 332ANZE1-N1 serialNumber: PF45343-AA roles:

— Reply to this email directly, view it on GitHubhttps://github.com/margo/specification/issues/9#issuecomment-2152556002, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BFHFKSL5LWXKWZZZSQYPPNTZGBRPRAVCNFSM6AAAAABI3PHNT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSGU2TMMBQGI. You are receiving this because you are on a team that was mentioned.

phil-abb commented 1 month ago

In the resource section, "cpu" and "capacity" are not at the same semantic level ("cpu" being a device and "capacity" a characteristic).

fair point. Maybe for memory and storage, we don't need to group them under capacity

resources:
  memory:
  storage:
  cpu:
  - model:
    architecture:
    cores:
    - type:
      count:
      maxFrequency:
phil-abb commented 1 month ago

Here is a real use case we have for an application requiring a specific piece of hardware

An application vendor has an application requiring an NVIDIA GPU on the device. The application is designed to work only with NVIDIA GPUs and recommends the NVIDIA GPU has at least 16 GB of RAM to run efficiently. For the application to run on the device the application vendor expects the device to have the NVIDIA device drivers installed as well as the Nvidia Operator on Kubernetes or NVIDIA Container toolkit for docker-compose based deployment. The drivers, operator, and toolkit require elevated permissions to install so the expectation is these are installed and configured by the device owner.

@margo/technical-wg I think it would be good if we could provide actual use cases we have where applications require a specific piece of hardware so we can see what the requirements are instead of trying to guess what they might be. I think this will help us figure out what needs to be in this file. Does anyone else have any real use cases?

gunjald commented 1 month ago

Here is a real use case we have for an application requiring a specific piece of hardware

An application vendor has an application requiring an NVIDIA GPU on the device. The application is designed to work only with NVIDIA GPUs and recommends the NVIDIA GPU has at least 16 GB of RAM to run efficiently. For the application to run on the device the application vendor expects the device to have the NVIDIA device drivers installed as well as the Nvidia Operator on Kubernetes or NVIDIA Container toolkit for docker-compose based deployment. The drivers, operator, and toolkit require elevated permissions to install so the expectation is these are installed and configured by the device owner.

@margo/technical-wg I think it would be good if we could provide actual use cases we have where applications require a specific piece of hardware so we can see what the requirements are instead of trying to guess what they might be. I think this will help us figure out what needs to be in this file. Does anyone else have any real use cases?

Few observations.

  1. In such use cases event the NVIDIA GPUs are having different architectures and possibly the CUDA capabilities may also differ. So the more specifics will need to be asked to determine the compatible GPU that application needs
  2. Not sure if the applications are supposed to consume "K8s Operator" themselves as it is something the management platform will use to manage the cluster resources or for similar functions.
ajcraig commented 3 weeks ago

The following notes were captured during the Device Requirements call. These will be incorporated into a Pull Request to finalize the Device Capabilities file.