v1beta1 planning - Githubissues

ffromani commented 2 years ago

proposed changes for v1beta1, to be discussed

move fields under status - NRT describes the status of resources on a node, so it makes sense to honor the spec/status split and to have the fields reported as status
add constants for values we use repeatedly, like "Node" (and possibly more). Also add constants for max supported NUMA zones (64)
evaluate addition of golang helpers, e.g. to extract the zone ID from the name
~make topologyPolicy a single string~
~report TM scope alongside policy~
~a more radical approach: split the TM configuration data into a new object~
remove topologyPolicy entirely. Document how node-local Topology Manager configuration can be expressed (~labels~ use top-level attributes added in v1alpha2)
~podfingerprint as value~ (from annotation: use top-level attributes added in v1alpha2)
fix short name: move from node-res-topo to nrt
rename package name (https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/issues/20)
change zonelist? (https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/pull/17)

ffromani commented 2 years ago

quick demo:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  creationTimestamp: "2022-04-21T11:30:27Z"
  name: noderesourcetopologies.topology.node.k8s.io
spec:
  conversion:
    strategy: None
  group: topology.node.k8s.io
  names:
    kind: NodeResourceTopology
    listKind: NodeResourceTopologyList
    plural: noderesourcetopologies
    shortNames:
      - nrt
    singular: noderesourcetopology
  scope: Cluster
  versions:
    - name: v1beta1
      schema:
        openAPIV3Schema:
          description: NodeResourceTopology describes node resources and their topology.
          properties:
            apiVersion:
              description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
              type: string
            kind:
              description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
              type: string
            metadata:
              type: object
            status:
              description: NodeResourceTopologyStatus defines the observed state of NodeResourceTopology
              zones:
                description: ZoneList contains an array of Zone objects.
                items:
                  description: Zone represents a resource topology zone, e.g. socket, node, die or core.
                  properties:
                    attributes:
                      description: AttributeList contains an array of AttributeInfo objects.
                      items:
                        description: AttributeInfo contains one attribute of a Zone.
                        properties:
                          name:
                            type: string
                          value:
                            type: string
                        required:
                          - name
                          - value
                        type: object
                      type: array
                    costs:
                      description: CostList contains an array of CostInfo objects.
                      items:
                        description: CostInfo describes the cost (or distance) between two Zones.
                        properties:
                          name:
                            type: string
                          value:
                            format: int64
                            type: integer
                        required:
                          - name
                          - value
                        type: object
                      type: array
                    name:
                      type: string
                    parent:
                      type: string
                    resources:
                      description: ResourceInfoList contains an array of ResourceInfo objects.
                      items:
                        description: ResourceInfo contains information about one resource type.
                        properties:
                          allocatable:
                            anyOf:
                              - type: integer
                              - type: string
                            description: Allocatable quantity of the resource, corresponding to allocatable in node status, i.e. total amount of this resource available to be used by pods.
                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                            x-kubernetes-int-or-string: true
                          available:
                            anyOf:
                              - type: integer
                              - type: string
                            description: Available is the amount of this resource currently available for new (to be scheduled) pods, i.e. Allocatable minus the resources reserved by currently running pods.
                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                            x-kubernetes-int-or-string: true
                          capacity:
                            anyOf:
                              - type: integer
                              - type: string
                            description: Capacity of the resource, corresponding to capacity in node status, i.e. total amount of this resource that the node has.
                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                            x-kubernetes-int-or-string: true
                          name:
                            description: Name of the resource.
                            type: string
                        required:
                          - allocatable
                          - available
                          - capacity
                          - name
                        type: object
                      type: array
                    type:
                      type: string
                  required:
                    - name
                    - type
                  type: object
                type: array
              type: object
          required:
            - zones
          type: object
      served: true
      storage: true

ffromani commented 1 year ago

xref: https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/issues/22 report TM scope alongside policy

PiotrProkop commented 1 year ago

Can we also include TopologyManagerPolicyOptions, I believe it should be of type []string.

ffromani commented 1 year ago

Can we also include TopologyManagerPolicyOptions, I believe it should be of type []string.

I'm unconvinced about this. We decided to add the TM policy and then we mixed with the scope, so clearly separating the scope makes sense. But arguably adding the TM policy was already a stretch, so I'm not keen to adding even more data which is not directly related to the NUMA zone resource tracking (TM data is not).

ffromani commented 1 year ago

that said, TM config is related to the problem we hare handling here, so we can totally discuss new extra objects to add to the general umbrealla of this API.

PiotrProkop commented 1 year ago

Yeah, was thinking about either. I was confused that NodeTopology is mixed with kubelet configuration. Maybe we can introduce new Custom Resource to track TM configuration.

ffromani commented 1 year ago

Yeah, was thinking about either. I was confused that NodeTopology is mixed with kubelet configuration. Maybe we can introduce new Custom Resource to track TM configuration.

This is something we can (and should!) totally do, I'm very open to explore this direction. We will need to design an object which is not too tied to the TM status, but I'm confident we can come out with something nice for v1beta1. Looking forward to hear your thoughts in this area.

PiotrProkop commented 1 year ago

Sounds good! Is there any design doc or PR where we can keep track of this?

ffromani commented 1 year ago

Sounds good! Is there any design doc or PR where we can keep track of this?

not yet, we will need to figure out the timeline. Word of warning: this is very unlikely to progress much in the next 2-3 weeks. When the design starts, I'll link it here. We will need a way to collaborate, so it's possible it will be a design doc ora PR to reason about (or both)

swatisehgal commented 1 year ago

make topologyPolicy a single string report TM scope alongside policy a more radical approach: split the TM configuration data into a new object

I am inclining towards the more radical approach here :) We should figure out a way to represent Topology Manager policies, scope and the newly introduced policy options. Perhaps something like a map[string][]string so that we can take in a Kubelet flags as key and their values captured in an array of strings which can be useful if the sys-admin decided to configure multiple policy options (when we have more of them even though it can be unlikely but still should not be ruled out).

Also, I noticed that we left out this field in the CRD spec above. Was that intentional so we have a discussion on this topic?

swatisehgal commented 1 year ago

Can we also include TopologyManagerPolicyOptions, I believe it should be of type []string.

I'm unconvinced about this. We decided to add the TM policy and then we mixed with the scope, so clearly separating the scope makes sense. But arguably adding the TM policy was already a stretch, so I'm not keen to adding even more data which is not directly related to the NUMA zone resource tracking (TM data is not).

I agree with @fromanirh that the use of topologyPolicies for capturing both topology manager policy and the scope has been a bit of an abuse of this field but at the same time see the value of representing the Topology related Kubelet configuration as part of this API. One reason for that is observability. Other is to enable additional use cases that could potentially make use of this API for custom use cases like the one @PiotrProkop mentioned he has been working on here.

Sounds good! Is there any design doc or PR where we can keep track of this?

not yet, we will need to figure out the timeline. Word of warning: this is very unlikely to progress much in the next 2-3 weeks. When the design starts, I'll link it here. We will need a way to collaborate, so it's possible it will be a design doc ora PR to reason about (or both)

@PiotrProkop If you have the time and would like, feel free to submit a PR that would unblock your work. Even though we don't have the bandwidth to drive this work, we will definitely make the time to review it :)

swatisehgal commented 1 year ago

make topologyPolicy a single string report TM scope alongside policy a more radical approach: split the TM configuration data into a new object

I am inclining towards the more radical approach here :) We should figure out a way to represent Topology Manager policies, scope and the newly introduced policy options. Perhaps something like a map[string][]string so that we can take in a Kubelet flags as key and their values captured in an array of strings which can be useful if the sys-admin decided to configure multiple policy options (when we have more of them even though it can be unlikely but still should not be ruled out).

Adding a note that use of maps in the APIs is discouraged: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#lists-of-named-subobjects-preferred-over-maps so we can't go with the approach I recommended earlier.

PiotrProkop commented 1 year ago

Since TopologyOptions are not dynamic and are built into kubelet we don't have to use maps or list but we can have explicit types for those options. The consequence is that every time a new option is added the API would have to be updated. Or we can embed PolicyOption struct and just rely on k8s version.

PiotrProkop commented 1 year ago

@swatisehgal @fromanirh should we move this discussion to https://github.com/kubernetes/kubernetes/pull/96275? or the intent is to merge v1alpha1 and then create a KEP to start discussion around v1beta1. Right now the only thing that is blocking me is that there is no way to retrieve the scope for other policies than single-numa, so should I ask to add tmScope to v1alpha1 or add missing policies:

RestrictedPodLevel
RestrictedContainerLevel
BestEffortPodLevel
BestEffortContainerLevel

And if I should ask for it should I do it here or in kubernetes PR 😄

swatisehgal commented 1 year ago

Sure, let's move the discussion to https://github.com/kubernetes/kubernetes/pull/96275. I would be inclined to merge as v1alpha1 and then handle v1beta1 separately. But, I am open to discussion and it depends on how the API review on the PR proceeds.

Right now the only thing that is blocking me is that there is no way to retrieve the scope for other policies than single-numa, so should I ask to add tmScope to v1alpha1 or add missing policies:

RestrictedPodLevel RestrictedContainerLevel BestEffortPodLevel BestEffortContainerLevel

Irrespective of the approach we take, we should add these missing policies for the sake of completeness, I will update my PR: 96275 as well as here.

PiotrProkop commented 1 year ago

Sounds good! I will leave comment on https://github.com/kubernetes/kubernetes/pull/96275 to provide visibility why this field is changed.

ffromani commented 1 year ago

note we will also need to change the updating agents (NFD, RTE) edit: and make sure the consumers (most notably the scheduler plugin) can tolerate this change.

PiotrProkop commented 1 year ago

So probably in this repo the:

Restricted TopologyManagerPolicy = "Restricted"
BestEffort TopologyManagerPolicy = "BestEffort"

Should stay and in kubernetes repo only policies mixed with scope will remain? And when NFD and RTE will be updated to use NodeResourceTopology from Kubernetes the removal of old policies will be handled?

ffromani commented 1 year ago

So probably in this repo the:
Restricted TopologyManagerPolicy = "Restricted"
BestEffort TopologyManagerPolicy = "BestEffort"
Should stay and in kubernetes repo only policies mixed with scope will remain? And when NFD and RTE will be updated to use NodeResourceTopology from Kubernetes the removal of old policies will be handled?

these are good points. We need to evaluate and explore the tradeoffs here.

swatisehgal commented 1 year ago

note we will also need to change the updating agents (NFD, RTE) edit: and make sure the consumers (most notably the scheduler plugin) can tolerate this change.

Yes, that's right. All the consumers of the API would have to be updated.

swatisehgal commented 1 year ago

So probably in this repo the:
Restricted TopologyManagerPolicy = "Restricted"
BestEffort TopologyManagerPolicy = "BestEffort"
Should stay and in kubernetes repo only policies mixed with scope will remain? And when NFD and RTE will be updated to use NodeResourceTopology from Kubernetes the removal of old policies will be handled?

I think, we should keep "Restricted" and "BestEffort" policies for https://github.com/k8stopologyawareschedwg/noderesourcetopology-api but in the kubernetes PR (API in staging) we should move to policy+scope for the sake of consistency. It will seem a bit odd to have "Restricted" and "BestEffort" policies and their corresponding policies with the scope. I foresee it raising eyebrows. When NFD, RTE and Scheduler Plugin transition to consuming the NRT API from the staging repo we would have to handle it carefully.

PiotrProkop commented 1 year ago

So probably in this repo the:
Restricted TopologyManagerPolicy = "Restricted"
BestEffort TopologyManagerPolicy = "BestEffort"
Should stay and in kubernetes repo only policies mixed with scope will remain? And when NFD and RTE will be updated to use NodeResourceTopology from Kubernetes the removal of old policies will be handled?
I think, we should keep "Restricted" and "BestEffort" policies for https://github.com/k8stopologyawareschedwg/noderesourcetopology-api but in the kubernetes PR (API in staging) we should move to policy+scope for the sake of consistency. It will seem a bit odd to have "Restricted" and "BestEffort" policies and their corresponding policies with the scope. I foresee it raising eyebrows. When NFD, RTE and Scheduler Plugin transition to consuming the NRT API from the staging repo we would have to be handled this carefully.

:+1:

ffromani commented 1 year ago

I'm very convinced we should remove configuration data (= TM policy, TM scope) from the NRT API proper. Question is: how to represent it, if at all? The more I think about it, the more I'm convinced we should not represent at all. Perhaps only define labels (and their values) and depend on other components (e.g.: NFD) to add these labels. Assuming nodes are properly labeled, we have the exact same information we currently expose in v1alpha1. And, I'd say, in a cleaner and more kubernetes-y way.

PiotrProkop commented 1 year ago

I'm very convinced we should remove configuration data (= TM policy, TM scope) from the NRT API proper. Question is: how to represent it, if at all? The more I think about it, the more I'm convinced we should not represent at all. Perhaps only define labels (and their values) and depend on other components (e.g.: NFD) to add these labels. Assuming nodes are properly labeled, we have the exact same information we currently expose in v1alpha1. And, I'd say, in a cleaner and more kubernetes-y way.

I agree. We are already relying on NFD to advertise NRT resources, so relying on specific labels makes sense.

swatisehgal commented 1 year ago

I like the idea of representing this information as labels. It gives us what we need and is much simpler than exposing a new object. Given that NFD is a well recognized software component intended to expose both hardware features and system configurations, I believe it is a good fit for storing configuration information like this.

@marquiz WDYT about NFD exposing Topology manager policy and scope as labels? Perhaps, we can expose Topology Manager policy options and CPU Manager policy/policy options as well?

ffromani commented 1 year ago

Continuing this train of thought, I think we should expose not just the TM policy, but something that abstracts a bit the node behavior enforced by that scope. Random brainstorm:

topology.node.k8s.io/numa-alignment: strong represents TM-policy single-numa-node
topology.node.k8s.io/numa-alignment: best-effort represents TM-policy best-effort
no label: TM disabled or none policy how to represent restricted? no clear idea yet :)

scope is easier:

topology.node.k8s.io/numa-scope: pod
topology.node.k8s.io/numa-scope: container

TL;DR: labels should represent concepts enabled by TM config, should not reflect TM config 1:1

swatisehgal commented 1 year ago

Continuing this train of thought, I think we should expose not just the TM policy, but something that abstracts a bit the node behavior enforced by that scope. Random brainstorm:

topology.node.k8s.io/numa-alignment: strong represents TM-policy single-numa-node

topology.node.k8s.io/numa-alignment: best-effort represents TM-policy best-effort

What benefit do you see in abstracting the concept as opposed to representing the configurations 1:1? Representing the topology manager configuration or in general kubelet configuration directly without abstracting it gives a clear view of what is being represented in the label (kubelet configuration on the node) and is much more intuitive rather than having to interpret what strong alignment is indicative of.

Abstracting the policies means that NFD has to translate the policies to an abstract label and some component has the responsibility of correctly interpreting it which makes me a bit nervous.

kad commented 1 year ago

just a general comment, please stop using term "NUMA" anywhere in API and please don't put it in new ones. It is not technically correct and not matching neither existing on the marked hardware nor upcoming in near future.

ffromani commented 1 year ago

just a general comment, please stop using term "NUMA" anywhere in API and please don't put it in new ones. It is not technically correct and not matching neither existing on the marked hardware nor upcoming in near future.

point taken.

k8stopologyawareschedwg / noderesourcetopology-api

v1beta1 planning #21