Minimal changes for partitionable devices in DRA evolution prototype

johnbelamaric commented 3 months ago

This PR adds the minimal fields needed to support partitionable devices. A few notes for consideration:

Terminology chosen is SharedAllocatable for the shared pooled resources, and SharedAllocatableConsumed for the device values that consume items from the pool.
SharedAllocatable is a []ResourceCapacity which is a struct with just name and quantity. This leaves out BlockSize and IntRange stuff to keep it as simple as possible.
SharedAllocatableConsumed is a map[string]resource.Quantity to mirror PodSpec requests. Given that this is the capacity model, not the claim model, consistency may not be needed here. In that case, we can probably change this to a struct instead, which would give us more room for expansion in the future.
Max in each list is arbitrarily set to 32.
Since SharedAllocatable is directly in ResourcePool, that means each partitionable device needs to be its own pool. We could consider two other options:
- Create another type for groups of SharedAllocatables. Kevin had this in an earlier version. Then each physical GPU would be in one of those, and the devices would name the group in addition to the pool in their SharedResourceConsumed:
```
type ResourcePool struct {
...
Devices []Device
```
SharedAllocatable []AllocatableGroup } type AllocatableGroup struct { Name string Allocatable []ResourceCapacity } type Device struct { ... SharedAllocatableConsumed []ResourceRequest } type ResourceRequest struct { AllocatableGroupName string ResourceName string Quantity resource.Quantity }
```
* Instead of the pool containing `Devices []Device` it would instead contain `DeviceGroups []DeviceGroup` where:
```go
type DeviceGroup struct {
SharedAllocatable []ResourceCapacity
Devices []Device
}
```
- I am sort of OK with the pool-per-GPU, as it is the most minimal change. I could also see going for the AllocatableGroup option as well, since it is incremental. I don't at this point like the last option much.

k8s-ci-robot commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/wg-device-management/blob/main/OWNERS)~~ [johnbelamaric] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

johnbelamaric commented 3 months ago

Partionable devices, along with driver-side magic, can also support the idea of "compound devices". Here's how it would work.

Suppose we have two drivers, one for GPUs and one for NICs. We have nodes with 8 GPUs and 4 NICs. We want to allow certain "valid" combinations of these to be consumed as a unit. The "rules" here are:

NICs are not available without a GPU
GPUs are available with or without a NIC, but specific GPUs go along with specific NICs, like:
- gpu0,1 go with nic0
- gpu2,3 go with nic1
- gpu4,5 go with nic2
- gpu6,7 go with nic3

This implies the following valid "molecules" for the triplet (gpu0, gpu1, nic0):

gpu0
gpu0 + nic0 (leaving gpu1 only available by itself, without a NIC)
gpu1
gpu1 + nic0 (leaving gpu0 only available by itself, without a NIC)
gpu0 + gpu1 + nic0

Similarly there are 5 valid combinations for other triplet.

So, what we do is create a "compound device" driver that runs on the node but acts as an intermediary between the K8s control plane and the drivers for the underlying devices. It contains an in-process mini API server that serves the ResourcePool API, and we point the GPU and NIC drivers at that local instance. The compound device driver uses those to construct a new compound pool on top of those drivers that follows the rules above, using this partitionable model:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourcePool
metadata:
  name: node0-compound0
spec:
  driver: compound.example.com
  nodeName: node0
  devices:
  - name: gpu0
    sharedConsumed:
      gpu0: 1
  - name: gpu0-nic0
    sharedConsumed:
      gpu0: 1
      nic0: 1
  - name: gpu1
    sharedConsumed:
      gpu1: 1
  - name: gpu1-nic0
    sharedConsumed:
      gpu1: 1
      nic0: 1
  - name: gpu0-gpu1-nic0
    sharedConsumed:
      gpu0: 1
      gpu1: 1
      nic0: 1
...
  sharedConsumable:
  - name: gpu0
    capacity: 1
  - name: gpu1
    capacity: 1
  - name: gpu2
    capacity: 1
  - name: gpu3
    capacity: 1
  - name: gpu4
    capacity: 1
  - name: gpu5
    capacity: 1
  - name: gpu6
    capacity: 1
  - name: gpu7
    capacity: 1
  - name: nic0
    capacity: 1
  - name: nic1
    capacity: 1
  - name: nic2
    capacity: 1
  - name: nic3
    capacity: 1

The compound device driver is the only one that actually publishes anything to the K8s control plane. It is also what kubelet makes calls to, and it in turn calls down to the other drivers.

There are lots of details to work out for this, of course. For example, ideally users don't need to know they are using this intermediary, except maybe based on the class they choose. This would mean that the CEL-based attributes they use should still be the ones used by the underlying devices, rather than some that are particular to the compound device driver (which also may have some). For that, we may need to make sure that attributes are qualified, always, rather than allowing the short-hand of "unqualified means from the driver". Otherwise I can see a lot of confusion, especially during copy-and-paste situations.

There are also a few limitations:

Underlying devices will only be available as part of a compound device. You cannot make them available both as independent devices managed by their own drivers, and as part of a compound device.
Theoretically it would be possible to make a compound device that includes partitions from underlying partitionable devices, but this would likely be pretty difficult and is not something we should make a priority (in fact the compound driver probably should not allow it, at least for the foreseeable future).
Similarly, when we add allocatable device resources as a way of creating fractional devices, we probably will want to exclude that from the compound device driver functionality. It may theoretically be possible to support fractional devices in the compound device by a utilizing fractional allocation of underlying devices, but that would be...complex.

klueska commented 3 months ago

I'm not sold on the complex-device scenario you proposed here, but I think we could iterate on that later. The more important thing is to agree on the API for partitionable devices, and I'm fairly happy with the naming / structure I proposed in my comment here: https://github.com/kubernetes-sigs/wg-device-management/pull/27/files#r1634768662

johnbelamaric commented 3 months ago

I'm not sold on the complex-device scenario you proposed here, but I think we could iterate on that later. The more important thing is to agree on the API for partitionable devices, and I'm fairly happy with the naming / structure I proposed in my comment here: https://github.com/kubernetes-sigs/wg-device-management/pull/27/files#r1634768662

Yeah, that is 100% on top of this without affecting what this looks like. It's something I want to prototype before too long - but it would be out-of-tree anyway :)

johnbelamaric commented 3 months ago

SGTM

pohly commented 3 months ago

The rationale for that:

The quantities are for machines, not humans, so efficiency trumps readability.
No nesting means that the maximum slice size can be higher, with the driver deciding how they want to use that.

thockin commented 3 months ago

Since this is "what's in the KEP" I think we should merge it and rebase all the options on it, so they appear as diffs. But I screwed up and LGTM'ed option 2 (#29) I don't hjave super on this repo, so I cannot manually fix

johnbelamaric commented 3 months ago

Ok, this matches the KEP. Merging.

kubernetes-sigs / wg-device-management

Minimal changes for partitionable devices in DRA evolution prototype #27