kubernetes-sigs / wg-device-management

Prototypes and experiments for WG Device Management.
Apache License 2.0
7 stars 7 forks source link

Changes based on review feedback #5

Closed johnbelamaric closed 4 months ago

johnbelamaric commented 5 months ago

Adds changes based on reviews with thockin, pohly, and klueska.

k8s-ci-robot commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/wg-device-management/blob/main/OWNERS)~~ [johnbelamaric] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
johnbelamaric commented 5 months ago

@klueska I started populating this model with data from your prototype (see nvidia.go). The main differences are:

Example:

- apiVersion: devmgmtproto.k8s.io/v1alpha1
  kind: DevicePool
  metadata:
    creationTimestamp: null
    name: nvidia-01-dgxa100
  spec:
    attributes:
    - name: vendor
      stringValue: nvidia
    - name: model
      stringValue: dgxa100
    devices:
    - attributes:
      - intValue: 0
        name: minor
      - intValue: 0
        name: index
      - name: uuid
        stringValue: GPU-cd300afb-e675-4278-af34-484347eb0d09
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0
      requests:
        memory: 40Gi
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-0
      requests:
        copy-engines: "1"
        memory: 4864Mi
        memory-slices-00: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-1
      requests:
        copy-engines: "1"
        memory: 4864Mi
        memory-slices-01: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-2
      requests:
        copy-engines: "1"
        memory: 4864Mi
        memory-slices-02: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-3
      requests:
        copy-engines: "1"
        memory: 4864Mi
        memory-slices-03: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-4
      requests:
        copy-engines: "1"
        memory: 4864Mi
        memory-slices-04: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-5
      requests:
        copy-engines: "1"
        memory: 4864Mi
        memory-slices-05: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-6
      requests:
        copy-engines: "1"
        memory: 4864Mi
        memory-slices-06: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 2g.10gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-2g.10gb-0
      requests:
        copy-engines: "2"
        decoders: "1"
        memory: 9856Mi
        memory-slices-00: "1"
        memory-slices-01: "1"
        multiprocessors: "28"
    - attributes:
      - name: mig-profile
        stringValue: 2g.10gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-2g.10gb-2
      requests:
        copy-engines: "2"
        decoders: "1"
        memory: 9856Mi
        memory-slices-02: "1"
        memory-slices-03: "1"
        multiprocessors: "28"
    - attributes:
      - name: mig-profile
        stringValue: 2g.10gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-2g.10gb-4
      requests:
        copy-engines: "2"
        decoders: "1"
        memory: 9856Mi
        memory-slices-04: "1"
        memory-slices-05: "1"
        multiprocessors: "28"
    - attributes:
      - name: mig-profile
        stringValue: 3g.20gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-3g.20gb-0
      requests:
        copy-engines: "3"
        decoders: "2"
        memory: 19968Mi
        memory-slices-00: "1"
        memory-slices-01: "1"
        memory-slices-02: "1"
        memory-slices-03: "1"
        multiprocessors: "42"
    - attributes:
      - name: mig-profile
        stringValue: 3g.20gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-3g.20gb-4
      requests:
        copy-engines: "3"
        decoders: "2"
        memory: 19968Mi
        memory-slices-04: "1"
        memory-slices-05: "1"
        memory-slices-06: "1"
        memory-slices-07: "1"
        multiprocessors: "42"
    - attributes:
      - name: mig-profile
        stringValue: 4g.20gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-4g.20gb-0
      requests:
        copy-engines: "4"
        decoders: "2"
        memory: 19968Mi
        memory-slices-00: "1"
        memory-slices-01: "1"
        memory-slices-02: "1"
        memory-slices-03: "1"
        multiprocessors: "56"
    - attributes:
      - name: mig-profile
        stringValue: 7g.40gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-7g.40gb-0
      requests:
        copy-engines: "7"
        decoders: "5"
        jpeg-engines: "1"
        memory: 40192Mi
        memory-slices-00: "1"
        memory-slices-01: "1"
        memory-slices-02: "1"
        memory-slices-03: "1"
        memory-slices-04: "1"
        memory-slices-05: "1"
        memory-slices-06: "1"
        memory-slices-07: "1"
        multiprocessors: "98"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb+me
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-me-0
      requests:
        copy-engines: "1"
        decoders: "1"
        jpeg-engines: "1"
        memory: 4864Mi
        memory-slices-00: "1"
        multiprocessors: "14"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb+me
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-me-1
      requests:
        copy-engines: "1"
        decoders: "1"
        jpeg-engines: "1"
        memory: 4864Mi
        memory-slices-01: "1"
        multiprocessors: "14"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb+me
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-me-2
      requests:
        copy-engines: "1"
        decoders: "1"
        jpeg-engines: "1"
        memory: 4864Mi
        memory-slices-02: "1"
        multiprocessors: "14"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb+me
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-me-3
      requests:
        copy-engines: "1"
        decoders: "1"
        jpeg-engines: "1"
        memory: 4864Mi
        memory-slices-03: "1"
        multiprocessors: "14"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb+me
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-me-4
      requests:
        copy-engines: "1"
        decoders: "1"
        jpeg-engines: "1"
        memory: 4864Mi
        memory-slices-04: "1"
        multiprocessors: "14"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb+me
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-me-5
      requests:
        copy-engines: "1"
        decoders: "1"
        jpeg-engines: "1"
        memory: 4864Mi
        memory-slices-05: "1"
        multiprocessors: "14"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.5gb+me
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.5gb-me-6
      requests:
        copy-engines: "1"
        decoders: "1"
        jpeg-engines: "1"
        memory: 4864Mi
        memory-slices-06: "1"
        multiprocessors: "14"
        ofa-engines: "1"
    - attributes:
      - name: mig-profile
        stringValue: 1g.10gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.10gb-0
      requests:
        copy-engines: "1"
        decoders: "1"
        memory: 9856Mi
        memory-slices-00: "1"
        memory-slices-01: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.10gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.10gb-2
      requests:
        copy-engines: "1"
        decoders: "1"
        memory: 9856Mi
        memory-slices-02: "1"
        memory-slices-03: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.10gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.10gb-4
      requests:
        copy-engines: "1"
        decoders: "1"
        memory: 9856Mi
        memory-slices-04: "1"
        memory-slices-05: "1"
        multiprocessors: "14"
    - attributes:
      - name: mig-profile
        stringValue: 1g.10gb
      - name: product-name
        stringValue: Mock NVIDIA A100-SXM4-40GB
      - name: brand
        stringValue: Nvidia
      - name: architecture
        stringValue: Ampere
      - name: cuda-compute-capability
        semVerValue: 8.0.0
      - name: driver-version
        semVerValue: 550.54.15
      - name: cuda-driver-version
        semVerValue: 12.4.0
      name: gpu-0-mig-1g.10gb-6
      requests:
        copy-engines: "1"
        decoders: "1"
        memory: 9856Mi
        memory-slices-06: "1"
        memory-slices-07: "1"
        multiprocessors: "14"
    driver: gpu.nvidia.com/dra
    nodeName: nvidia-01
    resources:
    - capacity: 40Gi
      name: memory
    - capacity: "98"
      name: multiprocessors
    - capacity: "7"
      name: copy-engines
    - capacity: "5"
      name: decoders
    - capacity: "0"
      name: encoders
    - capacity: "1"
      name: jpeg-engines
    - capacity: "1"
      name: ofa-engines
    - capacity: "1"
      name: memory-slices-00
    - capacity: "1"
      name: memory-slices-01
    - capacity: "1"
      name: memory-slices-02
    - capacity: "1"
      name: memory-slices-03
    - capacity: "1"
      name: memory-slices-04
    - capacity: "1"
      name: memory-slices-05
    - capacity: "1"
      name: memory-slices-06
    - capacity: "1"
      name: memory-slices-07
johnbelamaric commented 5 months ago

cc @pohly @thockin

johnbelamaric commented 4 months ago

Working on it. Will push commit soon

On Mon, May 6, 2024 at 1:04 PM Tim Hockin @.***> wrote:

@.**** commented on this pull request.

I see a lot of resolved comments but not sure what the resolution is?

In k8srm-prototype/pkg/api/capacity_types.go https://github.com/kubernetes-sigs/wg-device-management/pull/5#discussion_r1591464951 :

@@ -40,9 +40,17 @@ type DevicePoolSpec struct { // +optional

All lists should be declared +listType=atomic or have merge keys

In k8srm-prototype/pkg/api/capacity_types.go https://github.com/kubernetes-sigs/wg-device-management/pull/5#discussion_r1591439982 :

// +required
  • DeviceCount int json:"count,omitempty"
  • Devices []Device json:"devices,omitempty"

Here's what I dislike most - ambiguity.

One way to deal with what you describe is to decompose this into two separate models

Kevin, how do you think that would manifest in API? Something like this?

DevicePool: spec: driver: example.com/frobnicator devices:

  • simple: name: frob0 attributes: [ ... ]

vs.

DevicePool: spec: driver: example.com/frobnicator devices:

  • macro: name: frob0 sharedResources: { ... } attributes: [ ... ] devices:
    • name: frob0-sub1 attributes: [ ... ]
    • name: frob0-sub2 attributes: [ ... ]
    • name: frob0-sub3 attributes: [ ... ]

or do we push it up even higher?

DevicePool: spec: driver: example.com/frobnicator mode: Macro sharedResources: { ... } devices:

  • name: frob0 attributes: [ ... ]
  • name: frob0-sub1 attributes: [ ... ]
  • name: frob0-sub2 attributes: [ ... ]
  • name: frob0-sub3 attributes: [ ... ]

In k8srm-prototype/pkg/api/capacity_types.go https://github.com/kubernetes-sigs/wg-device-management/pull/5#discussion_r1591458325 :

@@ -40,9 +40,17 @@ type DevicePoolSpec struct { // +optional Attributes []Attribute json:"attributes,omitempty"

  • // DeviceCount contains the total number of devices in the pool.
  • // Resources are pooled resources that are shared by all devices in the
  • // pool. This is typically used when representing a partitionable
  • // device, and need not be populated otherwise.
  • //
  • // +optional
  • Resources []ResourceCapacity json:"resources,omitempty"

By "linking" claims, you're implicitly defining a sort of sub-claim concept and doing "immediate" allocation - is that the right reading?

In k8srm-prototype/pkg/api/claim_types.go https://github.com/kubernetes-sigs/wg-device-management/pull/5#discussion_r1591489552 :

  • // Driver will limit the scope of devices considered to only those
  • // published by the specified driver. If the DeviceClass specifies a
  • // Driver, this should be left empty. If it is not, then it MUST match
  • // the Driver in the DeviceClass. +// DeviceClaimInstance captures a claim which must be satisfied, +// or a group for which one must be sastisfied. +type DeviceClaimInstance struct {
  • // At least one of AllOf and OneOf must be populated.
  • // If fields of DeviceClaimDetail are populated, OneOf should
  • // be empty.
  • DeviceClaimDetail json:",inline"

Alternate option:

Simple

claims:

  • className: foo constraints: dev.foo = "bar"

Which really means:

claimsMode: AllOf claims:

  • className: foo constraints: dev.foo = "bar"

But if you need it, you can do OneOf

claimMode: OneOf

  • className: circle constraints: dev.foo = "bar"
  • className: square constraints: dev.model > 3

In k8srm-prototype/pkg/api/claim_types.go https://github.com/kubernetes-sigs/wg-device-management/pull/5#discussion_r1591493166 :

  • // DevicePoolName is the name of the DevicePool to which this
  • // device belongs. The driver for that device pool owns this
  • // entry.
  • // +required
  • DevicePoolName string json:"devicePoolName"
  • // DeviceName contains the name of the allocated Device.
  • // +required
  • DeviceName string json:"deviceName,omitempty"
  • // Conditions contains the latest observation of the device's state.
  • Conditions []metav1.Condition json:"conditions"
  • // DeviceIP contains the IP allocated for the device, if appropriate.
  • // +optional
  • DeviceIP *string json:"deviceIP,omitempty"

I think this is a second-order decision, but I see both sides of it. It's conventient to have known attributes here to avoid unnecessary indirections, but it is a slippery slope.

In k8srm-prototype/pkg/api/capacity_types.go https://github.com/kubernetes-sigs/wg-device-management/pull/5#discussion_r1591473110 :

  • //
  • // +optional
  • Resources []ResourceCapacity json:"resources,omitempty" +}
  • +type ResourceCapacity struct {

  • // Name is the resource name/type.
  • Name string json:"name"
  • // Capacity is the total capacity of the named resource.
  • // +required
  • Capacity resource.Quantity json:"capacity"
  • // BlockSize is the increments in which capacity is consumed. For
  • // example, if you can only allocate memory in 4k pages, then the
  • // block size should be "4Ki". Default is 1.

We don't communicate it for standard CPU and memory, and that seems OK?

On oen hand, if we know we will need it, we should include it so we don't have to deal with version skew. On the other hand, do we KNOW we will need it?

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/wg-device-management/pull/5#pullrequestreview-2041484159, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIHRM7COXGUD7JWKGKDWFLZA7O3VAVCNFSM6AAAAABHCSZBOOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANBRGQ4DIMJVHE . You are receiving this because you were mentioned.Message ID: @.***>

johnbelamaric commented 4 months ago

Ok, pushed a commit that addresses most of the comments. I am working on a revised set of pod spec examples. Here are the changes in the commit, along with some open questions.

Claim Model

Capacity Model

NVIDIA DGX A100 Example

Open Questions

johnbelamaric commented 4 months ago

Pushed one more commit that updates the classes.yaml for the new "label-selector based class of classes".

One limitation here is that since we cannot combine the label selector and a Constraints, we cannot create classes that apply constraints across another set of classes. So, to do "any 1Gbps SR-IOV NIC", I need to create a bunch of classes, one for each vendor, that constrains to 1Gbps NICs, and add an appropriate DeviceClass label, and then create the "any" class based on those labels. That's not ideal, but it does work. The user could also just use a "any SR-IOV NIC" class and put the constraint directly in the claim, in this case.

johnbelamaric commented 4 months ago

I think we need to merge this soon and continue additional changes in a new PR. It's getting unwieldy. Plus I need something to show tomorrow.

johnbelamaric commented 4 months ago

I am going to merge this in preparation for the meeting tomorrow. I have attempted to gather all the notes and open questions in an MD file here. Please let me know if I missed anything.