WIP: Add a POC of an alternate partitioning scheme

klueska commented 2 weeks ago

I haven't yet written this up properly (or added any code for it), but I wanted to push something out there with my thoughts around how to support partitioning in a more compact way.

Below is the (incomplete) YAML for what one A100 GPU with MIG disabled, one A100 with MIG enabled, and one H100 GPU (regardless of MIG mode) would look like. I am currently only showing the full GPUs and the 1g.*gb devices (because I wrote this by hand), but you can imagine how it would be expanded with the rest.

Most of it is self-explanatory, except for 1 thing -- what the new sharedCapacityInstances field on a device implies. It is a way to define a "boundary" for any shared capacity referenced in a device template. Meaning that all devices that provide the same mappings for a given sharedCapacityInstance will pull from the same SharedCapacity.

I will add more details soon (as well as a full prototype), but I wanted to get this out for initial comments before then.

sharedAttributeGroups:
- name: common-attributes
  attributes:
  - name: brand
    string: Nvidia
  - name: cuda-compute-capability
    version: 8.0.0
  - name: driver-version
    version: 550.54.15
  - name: cuda-driver-version
    version: 12.4.0

- name: a100-common-attributes
  attributes:
  - name: product-name
    string: Mock NVIDIA A100-SXM4-40GB
  - name: architecture
    string: Ampere

- name: h100-common-attributes
  attributes:
  - name: product-name
    string: Mock NVIDIA H100-SXM4-80GB
  - name: architecture
    string: Hopper

- name: gpu-0-common-attributes
  attributes:
  - name "k8s.io/pcie-root"
    string "pci_0"

- name: gpu-1-common-attributes
  attributes:
  - name "k8s.io/pcie-root"
    string "pci_1"

- name: gpu-2-common-attributes
  attributes:
  - name "k8s.io/pcie-root"
    string "pci_2"

sharedCapacityTemplates:
- name: a100-shared-resources
  capacities:
  - name: multiprocessors
    quantity: "98"
  - name: copy-engines
    quantity: "7"
  - name: decoders
    quantity: "5"
  - name: encoders
    quantity: "0"
  - name: jpeg-engines
    quantity: "1"
  - name: ofa-engines
    quantity: "1"
  - name: memory-slices
    intRange: 0-7

- name: h100-shared-resources
  capacities:
  - name: multiprocessors
    quantity: "132"
  - name: copy-engines
    quantity: "8"
  - name: decoders
    quantity: "7"
  - name: encoders
    quantity: "0"
  - name: jpeg-engines
    quantity: "7"
  - name: ofa-engines
    quantity: "1"
  - name: memory-slices
    intRange: 0-7

deviceTemplates:
- name: a100-whole-gpu
  sharedAttributeGroups:
  - common-attributes
  - a100-common-attributes
  attributes:
  - name: memory
    quantity: 40Gi
  - name: mig-capable
    bool: true
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: multiprocessors
      quantity: "98"
    - name: copy-engines
      quantity: "7"
    - name: decoders
      quantity: "5"
    - name: encoders
      quantity: "0"
    - name: jpeg-engines
      quantity: "1"
    - name: ofa-engines
      quantity: "1"
    - name: memory-slices
      intRange: 0-7

- name: h100-whole-gpu
  sharedAttributeGroups:
  - common-attributes
  - h100-common-attributes
  attributes:
  - name: memory
    quantity: 80Gi
  - name: mig-capable
    bool: true
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: multiprocessors
      quantity: "132"
    - name: copy-engines
      quantity: "8"
    - name: decoders
      quantity: "7"
    - name: encoders
      quantity: "0"
    - name: jpeg-engines
      quantity: "7"
    - name: ofa-engines
      quantity: "1"
    - name: memory-slices
      intRange: 0-7

- name: a100-mig-1g.5gb-base
  sharedAttributeGroups:
  - common-attributes
  - a100-common-attributes
  attributes:
  - name: mig-profile
    string: 1g.5gb
  - name: memory
    quantity: 4864Mi
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: multiprocessors
      quantity: "14"
    - name: copy-engines
      quantity: "1"
    - name: decoders
      quantity: "0"
    - name: encoders
      quantity: "0"
    - name: jpeg-engines
      quantity: "0"
    - name: ofa-engines
      quantity: "0"

- name: h100-mig-1g.10gb-base
  sharedAttributeGroups:
  - common-attributes
  - h100-common-attributes
  attributes:
  - name: mig-profile
    string: 1g.10gb
  - name: memory
    quantity: 9728Mi
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: multiprocessors
      quantity: "16"
    - name: copy-engines
      quantity: "1"
    - name: decoders
      quantity: "1"
    - name: encoders
      quantity: "0"
    - name: jpeg-engines
      quantity: "1"
    - name: ofa-engines
      quantity: "0"

- name: a100-mig-1g.5gb-0
  deviceTemplateName: a100-mig-1g.5gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "0"

- name: a100-mig-1g.5gb-1
  deviceTemplateName: a100-mig-1g.5gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "1"

- name: a100-mig-1g.5gb-2
  deviceTemplateName: a100-mig-1g.5gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "2"

- name: a100-mig-1g.5gb-3
  deviceTemplateName: a100-mig-1g.5gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "3"

- name: a100-mig-1g.5gb-4
  deviceTemplateName: a100-mig-1g.5gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "4"

- name: a100-mig-1g.5gb-5
  deviceTemplateName: a100-mig-1g.5gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "5"

- name: a100-mig-1g.5gb-6
  deviceTemplateName: a100-mig-1g.5gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: a100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "6"

- name: h100-mig-1g.10gb-0
  deviceTemplateName: h100-mig-1g.10gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "0"

- name: h100-mig-1g.10gb-1
  deviceTemplateName: h100-mig-1g.10gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "1"

- name: h100-mig-1g.10gb-2
  deviceTemplateName: h100-mig-1g.10gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "2"

- name: h100-mig-1g.10gb-3
  deviceTemplateName: h100-mig-1g.10gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "3"

- name: h100-mig-1g.10gb-4
  deviceTemplateName: h100-mig-1g.10gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "4"

- name: h100-mig-1g.10gb-5
  deviceTemplateName: h100-mig-1g.10gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "5"

- name: h100-mig-1g.10gb-6
  deviceTemplateName: h100-mig-1g.10gb-base
  sharedCapacitiesConsumed:
  - sharedCapacityTemplateName: h100-shared-resources
    capacities:
    - name: memory-slices
      intRange: "6"

devices:
# GPU 0 is an A100 with MIG disabled so is only advertised as a full GPU
- name: gpu-0
  deviceTemplateName: a100-whole-gpu
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-0
  sharedAttributeGroups:
  - gpu-0-common-attributes
  attributes:
  - name: index
    string: "0"
  - name: minor
    string: "0"
  - name: uuid
    string: GPU-0eaad900-5263-4fd6-b020-f03d30efac31

# GPU 1 is an A100 with MIG enabled so is only advertised as its full set of MIG devices
- name: gpu-1-mig-1g.5gb-0
  deviceTemplateName: a100-mig-1g.5gb-0
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-1
  sharedAttributeGroups:
  - gpu-1-common-attributes
  attributes:
  - name: parentIndex
    string: "1"

- name: gpu-1-mig-1g.5gb-1
  deviceTemplateName: a100-mig-1g.5gb-1
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-1
  sharedAttributeGroups:
  - gpu-1-common-attributes
  attributes:
  - name: parentIndex
    string: "1"

- name: gpu-1-mig-1g.5gb-2
  deviceTemplateName: a100-mig-1g.5gb-2
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-1
  sharedAttributeGroups:
  - gpu-1-common-attributes
  attributes:
  - name: parentIndex
    string: "1"

- name: gpu-1-mig-1g.5gb-3
  deviceTemplateName: a100-mig-1g.5gb-3
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-1
  sharedAttributeGroups:
  - gpu-1-common-attributes
  attributes:
  - name: parentIndex
    string: "1"

- name: gpu-1-mig-1g.5gb-4
  deviceTemplateName: a100-mig-1g.5gb-4
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-1
  sharedAttributeGroups:
  - gpu-1-common-attributes
  attributes:
  - name: parentIndex
    string: "1"

- name: gpu-1-mig-1g.5gb-5
  deviceTemplateName: a100-mig-1g.5gb-5
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-1
  sharedAttributeGroups:
  - gpu-1-common-attributes
  attributes:
  - name: parentIndex
    string: "1"

- name: gpu-1-mig-1g.5gb-6
  deviceTemplateName: a100-mig-1g.5gb-6
  sharedCapacityInstances:
  - templateName: a100-shared-resources
    instanceName: gpu-1
  sharedAttributeGroups:
  - gpu-1-common-attributes
  attributes:
  - name: parentIndex
    string: "1"

# GPU 2 is an H100 and advertises both its full GPU and all of its MIG devices
- name: gpu-2
  deviceTemplateName: h100-whole-gpu
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: index
    string: "2"
  - name: minor
    string: "2"
  - name: uuid
    string: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c

- name: gpu-2-mig-1g.10gb-0
  deviceTemplateName: h100-mig-1g.10gb-0
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: index
    string: "2:0"
  - name: parentIndex
    string: "2"

- name: gpu-2-mig-1g.10gb-1
  deviceTemplateName: h100-mig-1g.10gb-1
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: parentIndex
    string: "2"

- name: gpu-2-mig-1g.10gb-2
  deviceTemplateName: h100-mig-1g.10gb-2
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: parentIndex
    string: "2"

- name: gpu-2-mig-1g.10gb-3
  deviceTemplateName: h100-mig-1g.10gb-3
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: parentIndex
    string: "2"

- name: gpu-2-mig-1g.10gb-4
  deviceTemplateName: h100-mig-1g.10gb-4
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: parentIndex
    string: "2"

- name: gpu-2-mig-1g.10gb-5
  deviceTemplateName: h100-mig-1g.10gb-5
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: parentIndex
    string: "2"

- name: gpu-2-mig-1g.10gb-6
  deviceTemplateName: h100-mig-1g.10gb-6
  sharedCapacityInstances:
  - templateName: h100-shared-resources
    instanceName: gpu-2
  sharedAttributeGroups:
  - gpu-2-common-attributes
  attributes:
  - name: parentIndex
    string: "2"

k8s-ci-robot commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/wg-device-management/blob/main/OWNERS)~~ [klueska] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

klueska commented 2 weeks ago

In putting this together, it's become obvious that alot of what is being "templated" would be repeated in each and every slice. Would it be possible to create a separate API server object to hold the "template" objects for a given driver that can then be referenced by its resource slices? Possibly even leveraging a config map to do it instead of defining a new type.

johnbelamaric commented 2 weeks ago

In putting this together, it's become obvious that alot of what is being "templated" would be repeated in each and every slice. Would it be possible to create a separate API server object to hold the "template" objects for a given driver that can then be referenced by its resource slices? Possibly even leveraging a config map to do it instead of defining a new type.

Certainly it's possible, the question is whether it is worth the complexity, since then you have another independent object that can change or be missing, etc. This would help for any of the options 2, 4+ actually.

One thing we may want to think about is which factors drive scale and which are likely to grow over time, fastest:

individual partition shapes/templates (Ps)- if we think GPU sizes / memory blocks are going to increase dramatically, this number will increase
number of partitions per physical device (Ppd)- similarly, this will increase even more as memory blocks increase
number of physical devices per node (Dpn) - I expect this will stay around 8-16 for quite some time, WDYT?
number of nodes (N) - varies per cluster, but we should think O(10,000) at least, if not 10x that in the long run based on historical trends
number of slices per node (Spn) - depends on the particular slice size and specific slice design choices

We can characterize each suggestion then based on which of these scaling factors are relevant:

All options scale linearly with N (holding others fixed), but suggestions like the one quoted above can reduce the scale constant for some of the options. Nonetheless, for now let's think per node.
Options 1 and 3 on a per node basis scale O(Ppd * Dpn)
Option 2 puts some common attributes into the slice, so on a per-node basis, it splits the growth into two functions:
- O(Spn) for the factored out common attributes
- O(Ppd * Dpn) for the rest
Option 4 puts common attributes and partition map into the slice once, so it has two parts as well:
- O(Spn * Ppd) for the device shape
- O(Dpn) for the rest
Option 5 (yuck) shifts this to:
- O(Spn * Ps) - I suspect that Ps = O(log Ppd) so this is an improvement, to basically O(Spn * log Ppd)
- O(Dpn) for the rest
Option 6 (this option) shifts this to:
- O(Spn * Ps) - since you list each partition shape
- O(Ppd * Dpn) - since you explicitly list each device partition

The suggestion above would change these (on a per node basis):

Option 2 would stay O(Spn) for the common attributes but the scaling factor would change
Option 4 would become O(Spn) for the device shape since it would just be a constant reference
Option 5 would become O(Spn) for the device shape
Option 6 would become O(Spn) for the templates

Setting that aside, going back to the options without that suggestion, it would be possible to merge options 4 and 6 (option "10"...no, better stick with 7), such that: 1) We capture each partition shape once like in option 6; 2) Implicitly generate partitions like in option 4. If we did that, we would have:

O(Spn * Ps) for the shapes/templates
O(Dpn) - for the rest

which seems like the best we can do while keeping the repeated items in the slice.

johnbelamaric commented 2 weeks ago

Thinking more, I really do think that the things that will likely increase the most in the next 3-5 years are:

Partition shapes
Partitions per device
Number of nodes

This means that factoring out things that are duplicated per slice is a good idea, as number of slices will increase with N. Not only that, but if the "front matter" - the duplicated things like shapes/templates - increase in size, we leave less and less space for the actual devices. This causes an increase in slices per node!

In other words, let's try to prevent growth being a multiplicative factor of N with either Ps or Ppd.

This makes me think our best bet is going to be:

A resource that captures device/partition shapes/templates. The size and/or number of these will grow O(Ps * Ppd), but NOT with N.
The slice resource that references that resource for shapes. This pulls the "front matter" to a small constant (relative to Ps and Ppd). Thus, the total for this becomes O(N * Spn * Dpn). Since we expect Dpn to be relatively fixed, and since we moved all the "growth" out of the slice, Spn will also be fixed, so this is effectively O(N), which is really the best we can do.

klueska commented 2 weeks ago

I hadn't put the numbers together, but your conclusion at the end is where my head was when suggesting this. There will still need to be some per-slice "template" data (e.g. the pcie-root attributes from my example), but it would be info that is relevant just to the devices in the slice, so it actually lives in the appropriate place.

klueska commented 2 weeks ago

I picture one "front matter" object per GPU type which defines everything that is non-node-specific. And then each device in a resource slice has fields that point to a specific "front matter" object and then pull bits and pieces from it as appropriate.

klueska commented 2 weeks ago

Simple devices can still be just a named list of attributes, but if you want anything more sophisticated you have to start using this more complex structure.

johnbelamaric commented 2 weeks ago

I picture one "front matter" object per GPU type which defines everything that is non-node-specific. And then each device in a resource slice has fields that point to a specific "front matter" object and then pull bits and pieces from it as appropriate.

Yes, that's what I am thinking too. Basically push the invariant stuff across nodes into a separate object, and then refer to it. Those "front matter" pieces are probably constant for a given combination of hardware, firmware and driver versions.

johnbelamaric commented 2 weeks ago

FYI I added this as "Option 6" as well as the "Option 7" here: https://github.com/kubernetes-sigs/wg-device-management/issues/20#issuecomment-2168189769

klueska commented 2 weeks ago

In relation to what came up in the call tonight ...

Instead of having a single centralized object with all of the "front matter", we could have have one "front matter" object per node that all of the slices for that node refer to. It would likely have redundant information to most other nodes, but then we at least keep the front-matter separate from the resource slices that consume it (and if a driver does want to go through the headache of centralizing it, they still can).

kubernetes-sigs / wg-device-management

WIP: Add a POC of an alternate partitioning scheme #35