kubernetes / enhancements

Enhancements tracking repo for Kubernetes
Apache License 2.0
3.46k stars 1.49k forks source link

DRA: Resource Claim Status with possible standardized network interface data #4817

Open klueska opened 3 months ago

klueska commented 3 months ago

Enhancement Description

kannon92 commented 3 months ago

/sig node

squeed commented 2 months ago

FYI, here is the CNI result type.

A typical result looks something like:

{
    "ips": [
        {
          "address": "10.1.0.5/16",
          "gateway": "10.1.0.1",
          "interface": 2
        },
        {
          "address": "2001:db8::42/64",
          "gateway": "2001:db8::1",
          "interface": 2
        }
    ],
    "routes": [
      {
        "dst": "0.0.0.0/0"
      },
      {
        "dst": "::/0"
      }
    ],
    "interfaces": [
        {
            "name": "cni0",
            "mac": "00:11:22:33:44:55"
        },
        {
            "name": "veth3243",
            "mac": "55:44:33:22:11:11"
        },
        {
            "name": "eth0",
            "mac": "99:88:77:66:55:44",
            "sandbox": "/var/run/netns/blue"
        }
    ]
}

The CNI result type is mostly not exposed to k8s end-users. That said, IP addresses of the PodSandbox would be the bare minimum.

Note that IP addresses are, of course, mutable; if a node reboots, the PodSandbox will be recreated and new IPs allocated.

aojea commented 2 months ago

I was more leaning towards standardize by convention and not via Kubernetes API fields, think that it will be very complex on reach an agreement and support all the use cases, and also very difficult to iterate if everytime we need to go through API review. Adding a new field to routes or interfaces per example, will require to go through all the process, and there will be always niche use cases for some protocols

I was thinking using the CNI result types as foundation, but there are fields we don't want on the API for users, sandbox in the example.

I'm proposing to add a new field unstructured to Claim Status (right now the Configuration is also unstructured) where DRA drivers can write the results of each allocation, and use the CNI types to create a convention object we can share across networking DRA drivers

// DeviceConfiguration must have exactly one field set. It gets embedded
// inline in some other structs which have other fields, so field names must
// not conflict with those.
type DeviceConfiguration struct {
    // Opaque provides driver-specific configuration parameters.
    //
    // +optional
    // +oneOf=ConfigurationType
    Opaque *OpaqueDeviceConfiguration `json:"opaque,omitempty" protobuf:"bytes,1,opt,name=opaque"`
}

cc: @danwinship

LionelJouin commented 2 months ago

I made a PoC about the ResourceClaim status for networking: https://gist.github.com/LionelJouin/5cfc11eecf73663b5657ed3be1eb6c00

The AllocatedDeviceStatus has been extended to contain two new fields:

For reference for the NetworkDeviceInfo field, Multus works with annotations and reports the status in a special annotation. Here is the API it uses for it: https://github.com/k8snetworkplumbingwg/network-attachment-definition-client/blob/v1.7.1/pkg/apis/k8s.cni.cncf.io/v1/types.go#L103

This PoC uses CNI, the result is parsed and the IPs, Interface Name and Mac address is added to the NetworkDeviceInfo field and the full CNI result is added to the DeviceInfo field.

Here are the changes to the API:

// AllocatedDeviceStatus contains the status of an allocated device, if the
// driver chooses to report it. This may include driver-specific information.
type AllocatedDeviceStatus struct {
    ...
    // DeviceInfo contains Arbitrary driver-specific data.
    //
    // +optional
    DeviceInfo runtime.RawExtension `json:"deviceInfo,omitempty" protobuf:"bytes,6,rep,name=deviceInfo"`

    // NetworkDeviceInfo contains network-related information specific to the device.
    //
    // +optional
    NetworkDeviceInfo NetworkDeviceInfo `json:"networkDeviceInfo,omitempty" protobuf:"bytes,7,rep,name=networkDeviceInfo"`
}

// NetworkDeviceInfo provides network-related details for the allocated device.
// This information may be filled by drivers or other components to configure
// or identify the device within a network context.
type NetworkDeviceInfo struct {
    // Interface specifies the name of the network interface associated with
    // the allocated device. This might be the name of a physical or virtual
    // network interface.
    //
    // +optional
    Interface string `json:"interface,omitempty" protobuf:"bytes,1,rep,name=interface"`

    // IPs lists the IP addresses assigned to the device's network interface.
    // This can include both IPv4 and IPv6 addresses.
    //
    // +optional
    IPs []string `json:"ips,omitempty" protobuf:"bytes,2,rep,name=ips"`

    // Mac represents the MAC address of the device's network interface.
    //
    // +optional
    Mac string `json:"mac,omitempty" protobuf:"bytes,3,rep,name=mac"`
}

Here is an example of the final ResourceClaim after running the demo of this PoC:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: macvlan-eth0-attachment
spec:
  devices:
    config:
    - opaque:
        driver: poc.dra.networking
        parameters:
          config: '{ "cniVersion": "1.0.0", "name": "macvlan-eth0", "plugins": [ {
            "type": "macvlan", "master": "eth0", "mode": "bridge", "ipam": { "type":
            "host-local", "ranges": [ [ { "subnet": "10.10.1.0/24" } ] ] } } ] }'
          interface: net1
      requests:
      - macvlan-eth0
    requests:
    - allocationMode: ExactCount
      count: 1
      deviceClassName: cni-v1
      name: macvlan-eth0
status:
  allocation:
    devices:
      config:
      - opaque:
          driver: poc.dra.networking
          parameters:
            config: '{ "cniVersion": "1.0.0", "name": "macvlan-eth0", "plugins": [
              { "type": "macvlan", "master": "eth0", "mode": "bridge", "ipam": { "type":
              "host-local", "ranges": [ [ { "subnet": "10.10.1.0/24" } ] ] } } ] }'
            interface: net1
        requests:
        - macvlan-eth0
        source: FromClaim
      results:
      - device: cni
        driver: poc.dra.networking
        pool: kind-worker
        request: macvlan-eth0
    nodeSelector:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - kind-worker
  deviceStatuses:
  - conditions: null
    device: cni
    deviceInfo:
      cniVersion: 1.0.0
      interfaces:
      - mac: 1e:32:6c:b7:c9:66
        name: net1
        sandbox: /var/run/netns/cni-5b7c0846-7995-9450-f441-a177399d08d5
      ips:
      - address: 10.10.1.2/24
        gateway: 10.10.1.1
        interface: 0
    driver: poc.dra.networking
    networkDeviceInfo:
      interface: net1
      ips:
      - 10.10.1.2/24
      mac: 1e:32:6c:b7:c9:66
    pool: kind-worker
    request: macvlan-eth0
  reservedFor:
  - name: demo-a
    resource: pods
    uid: 2bd46adf-b478-4e25-9e37-828539799169
pohly commented 2 months ago

The AllocatedDeviceStatus has been extended to contain two new fields:

Not just that, the entire claim.status.deviceStatuses field is new. Who writes that field? I see in https://github.com/LionelJouin/kubernetes/commits/dra-device-status/ that you grant kubelet permission to write the claim status, but not the actual code.

The direction from SIG Auth is to move away from having kubelet publish information on behalf of node agents. It's better to authorize those agents and let them do the writes themselves. That enables version skew (kubelet from 1.31, control plane from 1.32 with a different API) and keeps kubelet out of the authorization path.

I don't have a strong opinion about what information should be published. API reviewers will want a strong justification for using a RawExtension. But a strongly typed NetworkDeviceInfo struct isn't an easy sell either unless there's consensus about which information should be standardized like that.

      - opaque:
          driver: poc.dra.networking
          parameters:
            config: '{ "cniVersion": "1.0.0", "name": "macvlan-eth0", "plugins": [
              { "type": "macvlan", "master": "eth0", "mode": "bridge", "ipam": { "type":
              "host-local", "ranges": [ [ { "subnet": "10.10.1.0/24" } ] ] } } ] }'
            interface: net1

Why is config a string? Wouldn't this be better:

      - opaque:
          driver: poc.dra.networking
          parameters:
            config:
              cniVersion: "1.0.0"
              name: "macvlan-eth0
              ...
            interface: net1

There's no kind and version in the parameters content. You will have to think about versioning those parameters when going forward with standardizing them.

LionelJouin commented 2 months ago

Oops, that's correct, my bad, sorry, here it the API change:

// ResourceClaimStatus tracks whether the resource has been allocated and what
// the result of that was.
type ResourceClaimStatus struct {
    ...

    // DeviceStatuses contains the status of each device allocated for this
    // claim, as reported by the driver. This can include driver-specific
    // information. Entries are owned by their respective drivers.
    //
    // +optional
    // +listType=map
    // +listMapKey=devicePoolName
    // +listMapKey=deviceName
    DeviceStatuses []AllocatedDeviceStatus `json:"deviceStatuses,omitempty" protobuf:"bytes,4,opt,name=deviceStatuses"`
}

// AllocatedDeviceStatus contains the status of an allocated device, if the
// driver chooses to report it. This may include driver-specific information.
type AllocatedDeviceStatus struct {
    // Request is the name of the request in the claim which caused this
    // device to be allocated. Multiple devices may have been allocated
    // per request.
    //
    // +required
    Request string `json:"request" protobuf:"bytes,1,rep,name=request"`

    // Driver specifies the name of the DRA driver whose kubelet
    // plugin should be invoked to process the allocation once the claim is
    // needed on a node.
    //
    // Must be a DNS subdomain and should end with a DNS domain owned by the
    // vendor of the driver.
    //
    // +required
    Driver string `json:"driver" protobuf:"bytes,2,rep,name=driver"`

    // This name together with the driver name and the device name field
    // identify which device was allocated (`<driver name>/<pool name>/<device name>`).
    //
    // Must not be longer than 253 characters and may contain one or more
    // DNS sub-domains separated by slashes.
    //
    // +required
    Pool string `json:"pool" protobuf:"bytes,3,rep,name=pool"`

    // Device references one device instance via its name in the driver's
    // resource pool. It must be a DNS label.
    //
    // +required
    Device string `json:"device" protobuf:"bytes,4,rep,name=device"`

    // Conditions contains the latest observation of the device's state.
    // If the device has been configured according to the class and claim
    // config references, the `Ready` condition should be True.
    //
    // +optional
    // +listType=atomic
    Conditions []metav1.Condition `json:"conditions" protobuf:"bytes,5,rep,name=conditions"`

    // DeviceInfo contains Arbitrary driver-specific data.
    //
    // +optional
    DeviceInfo runtime.RawExtension `json:"deviceInfo,omitempty" protobuf:"bytes,6,rep,name=deviceInfo"`

    // NetworkDeviceInfo contains network-related information specific to the device.
    //
    // +optional
    NetworkDeviceInfo NetworkDeviceInfo `json:"networkDeviceInfo,omitempty" protobuf:"bytes,7,rep,name=networkDeviceInfo"`
}

// NetworkDeviceInfo provides network-related details for the allocated device.
// This information may be filled by drivers or other components to configure
// or identify the device within a network context.
type NetworkDeviceInfo struct {
    // Interface specifies the name of the network interface associated with
    // the allocated device. This might be the name of a physical or virtual
    // network interface.
    //
    // +optional
    Interface string `json:"interface,omitempty" protobuf:"bytes,1,rep,name=interface"`

    // IPs lists the IP addresses assigned to the device's network interface.
    // This can include both IPv4 and IPv6 addresses.
    //
    // +optional
    IPs []string `json:"ips,omitempty" protobuf:"bytes,2,rep,name=ips"`

    // Mac represents the MAC address of the device's network interface.
    //
    // +optional
    Mac string `json:"mac,omitempty" protobuf:"bytes,3,rep,name=mac"`
}

The ResourceClaim Status is set via the Kubernetes API by the component calling the CNI add. The flow is the following:

  1. NodePrepareResources is called (Server is Containerd)
    1. The ResourceClaim is retrieved via the Kubernetes API
  2. RunPodSandbox is called (Server is Containerd)
    1. The CNIs are called
    2. The ResourceClaim status is updated via the Kubernetes API

In this PoC, The container runtime acts as DRA-Driver (I mixed several PoCs together), so it calls the CNIs and sets the status via the Kubernetes API. I used the kubelet kubeconfig to access the Kubernetes API (as for simplicity since Containerd was not using the Kubernetes API before).

Yes, that's right, the config could be in better format, for this PoC I mainly focused on the ResourceClaim Status and DRA-Driver in Containerd.

pohly commented 2 months ago

I used the kubelet kubeconfig to access the Kubernetes API (as for simplicity since Containerd was not using the Kubernetes API before).

So it wasn't kubelet itself, thus avoiding the version skew problem there at the cost moving it into containerd. How to secure that write will be tricky.

johnbelamaric commented 2 months ago

@LionelJouin this is great to see. Antonio, Tim, and I put together some thoughts on the API options (open to dev@kubernetes.io):

https://docs.google.com/document/d/1smjEm7WxFm8btQwywdr_F5BZkiifatNqDvrQUaesu1M/edit?usp=sharing

Personally, I like Option 5 - real, built-in API.

If code snippets help, I can add those. Once we decide on the rough API, I can start the KEP. We'll need to figure out the authorization bit: this needs to be writeable by the drivers, but only for claims on the same node as the driver instance. @pohly I know you have looked a lot at the node authorizer, do you think we can use that here?

johnbelamaric commented 2 months ago

/assign @aojea

pohly commented 2 months ago

@pohly I know you have looked a lot at the node authorizer, do you think we can use that here?

Better ask SIG Auth. The way I remember it, the answer is currently "no, node authorizer is only for kubelet, not agents running on a node". But I think there were plans to change that, which might be sufficient for the this issue (start as alpha without locking down access, make it a requirement for beta that it gets locked down through some SIG Auth KEP).

haircommander commented 2 months ago

/milestone v1.32 /label lead-opted-in

johnbelamaric commented 2 months ago

/assign @LionelJouin

Lionel is getting a PR ready.

jenshu commented 2 months ago

Hello @klueska @pohly @johnbelamaric @thockin @aojea 👋, Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 11th October 2024 / 19:00 PDT Thursday 10th October 2024.

This enhancement is targeting for stage alpha for v1.32 (correct me, if otherwise)

Here's where this enhancement currently stands:

For this KEP, we would just need to update the following:

The status of this enhancement is marked as at risk for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

jenshu commented 1 month ago

1.32 Enhancements team here. I see that the KEP and PRR have been merged, so I've updated the status to tracked for enhancements freeze!

chanieljdan commented 1 month ago

Hi @klueska @pohly @johnbelamaric @thockin @aojea 👋, 1.32 Release Docs Lead here.

Does this enhancement work planned for 1.32 require any new docs or modification to existing docs?

If so, please follows the steps here to open a PR against dev-1.32 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday October 24th 2024 18:00 PDT.

Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.

Thank you!

mbianchidev commented 1 month ago

Hey hey @LionelJouin @aojea @johnbelamaric 👋 from the v1.32 Communications Team!

We'd love for you to consider writing a feature blog about your enhancement. Some reasons why you might want to write a blog for this feature include (but are not limited to) if this introduces breaking changes, is important to our users, or has been in progress for a long time and it is graduating.

To opt-in, let us know by opening a Feature Blog placeholder PR against the website repository by 30th Oct 2024. For more information about writing a blog see the blog contribution guidelines.

Note: In your placeholder PR, use XX characters for the blog date in the front matter and file name. We will work with you on updating the PR with the publication date once we finalize the blog schedule.

chanieljdan commented 1 month ago

Hi @klueska @pohly @johnbelamaric @thockin @aojea 👋, 1.32 Release Docs Lead here.

Does this enhancement work planned for 1.32 require any new docs or modification to existing docs?

If so, please follows the steps here to open a PR against dev-1.32 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday October 24th 2024 18:00 PDT.

Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.

Thank you!

Hi @klueska @pohly @johnbelamaric @thockin @aojea 👋,

Just a reminder to open a placeholder PR against dev-1.32 branch in the k/website repo for this (steps available here). The deadline for this is a week away at Thursday October 24, 2024 18:00 PDT.

Thanks,

Daniel

jenshu commented 1 month ago

Hey again @klueska @pohly @johnbelamaric @thockin @aojea 👋, v1.32 Enhancements team here.

Just checking in as we approach code freeze at 02:00 UTC Friday 8th November 2024 / 19:00 PDT Thursday 7th November 2024.

Here's where this enhancement currently stands:

For this enhancement, it looks like the following PRs need to be merged before code freeze (and we need to update the Issue description to include all the related PRs of this KEP):

If you anticipate missing code freeze, you can file an exception request in advance.

Also, please let me know if there are other PRs in k/k we should be tracking for this KEP. As always, we are here to help if any questions come up. Thanks!

jenshu commented 3 weeks ago

Hello @klueska @pohly @johnbelamaric @thockin @aojea 👋, v1.32 Enhancements team here.

With all the implementation (code related) PRs merged as per the issue description:

This enhancement is now marked as tracked for code freeze for the v1.32 Code Freeze!

pohly commented 1 week ago

/wg device-management