Open ffromani opened 2 years ago
quick demo:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
creationTimestamp: "2022-04-21T11:30:27Z"
name: noderesourcetopologies.topology.node.k8s.io
spec:
conversion:
strategy: None
group: topology.node.k8s.io
names:
kind: NodeResourceTopology
listKind: NodeResourceTopologyList
plural: noderesourcetopologies
shortNames:
- nrt
singular: noderesourcetopology
scope: Cluster
versions:
- name: v1beta1
schema:
openAPIV3Schema:
description: NodeResourceTopology describes node resources and their topology.
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
type: object
status:
description: NodeResourceTopologyStatus defines the observed state of NodeResourceTopology
zones:
description: ZoneList contains an array of Zone objects.
items:
description: Zone represents a resource topology zone, e.g. socket, node, die or core.
properties:
attributes:
description: AttributeList contains an array of AttributeInfo objects.
items:
description: AttributeInfo contains one attribute of a Zone.
properties:
name:
type: string
value:
type: string
required:
- name
- value
type: object
type: array
costs:
description: CostList contains an array of CostInfo objects.
items:
description: CostInfo describes the cost (or distance) between two Zones.
properties:
name:
type: string
value:
format: int64
type: integer
required:
- name
- value
type: object
type: array
name:
type: string
parent:
type: string
resources:
description: ResourceInfoList contains an array of ResourceInfo objects.
items:
description: ResourceInfo contains information about one resource type.
properties:
allocatable:
anyOf:
- type: integer
- type: string
description: Allocatable quantity of the resource, corresponding to allocatable in node status, i.e. total amount of this resource available to be used by pods.
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
available:
anyOf:
- type: integer
- type: string
description: Available is the amount of this resource currently available for new (to be scheduled) pods, i.e. Allocatable minus the resources reserved by currently running pods.
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
capacity:
anyOf:
- type: integer
- type: string
description: Capacity of the resource, corresponding to capacity in node status, i.e. total amount of this resource that the node has.
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
name:
description: Name of the resource.
type: string
required:
- allocatable
- available
- capacity
- name
type: object
type: array
type:
type: string
required:
- name
- type
type: object
type: array
type: object
required:
- zones
type: object
served: true
storage: true
xref: https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/issues/22 report TM scope alongside policy
Can we also include TopologyManagerPolicyOptions, I believe it should be of type []string
.
Can we also include TopologyManagerPolicyOptions, I believe it should be of type
[]string
.
I'm unconvinced about this. We decided to add the TM policy and then we mixed with the scope, so clearly separating the scope makes sense. But arguably adding the TM policy was already a stretch, so I'm not keen to adding even more data which is not directly related to the NUMA zone resource tracking (TM data is not).
that said, TM config is related to the problem we hare handling here, so we can totally discuss new extra objects to add to the general umbrealla of this API.
Yeah, was thinking about either. I was confused that NodeTopology is mixed with kubelet
configuration. Maybe we can introduce new Custom Resource to track TM configuration.
Yeah, was thinking about either. I was confused that NodeTopology is mixed with
kubelet
configuration. Maybe we can introduce new Custom Resource to track TM configuration.
This is something we can (and should!) totally do, I'm very open to explore this direction. We will need to design an object which is not too tied to the TM status, but I'm confident we can come out with something nice for v1beta1. Looking forward to hear your thoughts in this area.
Sounds good! Is there any design doc or PR where we can keep track of this?
Sounds good! Is there any design doc or PR where we can keep track of this?
not yet, we will need to figure out the timeline. Word of warning: this is very unlikely to progress much in the next 2-3 weeks. When the design starts, I'll link it here. We will need a way to collaborate, so it's possible it will be a design doc ora PR to reason about (or both)
make topologyPolicy a single string report TM scope alongside policy a more radical approach: split the TM configuration data into a new object
I am inclining towards the more radical approach here :) We should figure out a way to represent Topology Manager policies, scope and the newly introduced policy options. Perhaps something like a map[string][]string
so that we can take in a Kubelet flags as key and their values captured in an array of strings which can be useful if the sys-admin decided to configure multiple policy options (when we have more of them even though it can be unlikely but still should not be ruled out).
Also, I noticed that we left out this field in the CRD spec above. Was that intentional so we have a discussion on this topic?
Can we also include TopologyManagerPolicyOptions, I believe it should be of type
[]string
.I'm unconvinced about this. We decided to add the TM policy and then we mixed with the scope, so clearly separating the scope makes sense. But arguably adding the TM policy was already a stretch, so I'm not keen to adding even more data which is not directly related to the NUMA zone resource tracking (TM data is not).
I agree with @fromanirh that the use of topologyPolicies
for capturing both topology manager policy and the scope has been a bit of an abuse of this field but at the same time see the value of representing the Topology related Kubelet configuration as part of this API. One reason for that is observability. Other is to enable additional use cases that could potentially make use of this API for custom use cases like the one @PiotrProkop mentioned he has been working on here.
Sounds good! Is there any design doc or PR where we can keep track of this?
not yet, we will need to figure out the timeline. Word of warning: this is very unlikely to progress much in the next 2-3 weeks. When the design starts, I'll link it here. We will need a way to collaborate, so it's possible it will be a design doc ora PR to reason about (or both)
@PiotrProkop If you have the time and would like, feel free to submit a PR that would unblock your work. Even though we don't have the bandwidth to drive this work, we will definitely make the time to review it :)
make topologyPolicy a single string report TM scope alongside policy a more radical approach: split the TM configuration data into a new object
I am inclining towards the more radical approach here :) We should figure out a way to represent Topology Manager policies, scope and the newly introduced policy options. Perhaps something like a
map[string][]string
so that we can take in a Kubelet flags as key and their values captured in an array of strings which can be useful if the sys-admin decided to configure multiple policy options (when we have more of them even though it can be unlikely but still should not be ruled out).
Adding a note that use of maps in the APIs is discouraged: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#lists-of-named-subobjects-preferred-over-maps so we can't go with the approach I recommended earlier.
Since TopologyOptions
are not dynamic and are built into kubelet
we don't have to use maps or list but we can have explicit types for those options. The consequence is that every time a new option is added the API would have to be updated. Or we can embed PolicyOption struct and just rely on k8s
version.
@swatisehgal @fromanirh should we move this discussion to https://github.com/kubernetes/kubernetes/pull/96275? or the intent is to merge v1alpha1
and then create a KEP to start discussion around v1beta1
. Right now the only thing that is blocking me is that there is no way to retrieve the scope
for other policies than single-numa
, so should I ask to add tmScope
to v1alpha1
or add missing policies:
RestrictedPodLevel
RestrictedContainerLevel
BestEffortPodLevel
BestEffortContainerLevel
And if I should ask for it should I do it here or in kubernetes
PR 😄
Sure, let's move the discussion to https://github.com/kubernetes/kubernetes/pull/96275. I would be inclined to merge as v1alpha1
and then handle v1beta1
separately. But, I am open to discussion and it depends on how the API review on the PR proceeds.
Right now the only thing that is blocking me is that there is no way to retrieve the scope for other policies than single-numa, so should I ask to add tmScope to v1alpha1 or add missing policies:
RestrictedPodLevel RestrictedContainerLevel BestEffortPodLevel BestEffortContainerLevel
Irrespective of the approach we take, we should add these missing policies for the sake of completeness, I will update my PR: 96275 as well as here.
Sounds good! I will leave comment on https://github.com/kubernetes/kubernetes/pull/96275 to provide visibility why this field is changed.
note we will also need to change the updating agents (NFD, RTE) edit: and make sure the consumers (most notably the scheduler plugin) can tolerate this change.
So probably in this repo the:
Restricted TopologyManagerPolicy = "Restricted"
BestEffort TopologyManagerPolicy = "BestEffort"
Should stay and in kubernetes
repo only policies mixed with scope will remain?
And when NFD and RTE will be updated to use NodeResourceTopology
from Kubernetes the removal of old policies will be handled?
So probably in this repo the:
Restricted TopologyManagerPolicy = "Restricted" BestEffort TopologyManagerPolicy = "BestEffort"
Should stay and in
kubernetes
repo only policies mixed with scope will remain? And when NFD and RTE will be updated to useNodeResourceTopology
from Kubernetes the removal of old policies will be handled?
these are good points. We need to evaluate and explore the tradeoffs here.
note we will also need to change the updating agents (NFD, RTE) edit: and make sure the consumers (most notably the scheduler plugin) can tolerate this change.
Yes, that's right. All the consumers of the API would have to be updated.
So probably in this repo the:
Restricted TopologyManagerPolicy = "Restricted" BestEffort TopologyManagerPolicy = "BestEffort"
Should stay and in
kubernetes
repo only policies mixed with scope will remain? And when NFD and RTE will be updated to useNodeResourceTopology
from Kubernetes the removal of old policies will be handled?
I think, we should keep "Restricted" and "BestEffort" policies for https://github.com/k8stopologyawareschedwg/noderesourcetopology-api but in the kubernetes PR (API in staging) we should move to policy+scope for the sake of consistency. It will seem a bit odd to have "Restricted" and "BestEffort" policies and their corresponding policies with the scope. I foresee it raising eyebrows. When NFD, RTE and Scheduler Plugin transition to consuming the NRT API from the staging repo we would have to handle it carefully.
So probably in this repo the:
Restricted TopologyManagerPolicy = "Restricted" BestEffort TopologyManagerPolicy = "BestEffort"
Should stay and in
kubernetes
repo only policies mixed with scope will remain? And when NFD and RTE will be updated to useNodeResourceTopology
from Kubernetes the removal of old policies will be handled?I think, we should keep "Restricted" and "BestEffort" policies for https://github.com/k8stopologyawareschedwg/noderesourcetopology-api but in the kubernetes PR (API in staging) we should move to policy+scope for the sake of consistency. It will seem a bit odd to have "Restricted" and "BestEffort" policies and their corresponding policies with the scope. I foresee it raising eyebrows. When NFD, RTE and Scheduler Plugin transition to consuming the NRT API from the staging repo we would have to be handled this carefully.
:+1:
I'm very convinced we should remove configuration data (= TM policy, TM scope) from the NRT API proper. Question is: how to represent it, if at all? The more I think about it, the more I'm convinced we should not represent at all. Perhaps only define labels (and their values) and depend on other components (e.g.: NFD) to add these labels. Assuming nodes are properly labeled, we have the exact same information we currently expose in v1alpha1. And, I'd say, in a cleaner and more kubernetes-y way.
I'm very convinced we should remove configuration data (= TM policy, TM scope) from the NRT API proper. Question is: how to represent it, if at all? The more I think about it, the more I'm convinced we should not represent at all. Perhaps only define labels (and their values) and depend on other components (e.g.: NFD) to add these labels. Assuming nodes are properly labeled, we have the exact same information we currently expose in v1alpha1. And, I'd say, in a cleaner and more kubernetes-y way.
I agree. We are already relying on NFD to advertise NRT resources, so relying on specific labels makes sense.
I like the idea of representing this information as labels. It gives us what we need and is much simpler than exposing a new object. Given that NFD is a well recognized software component intended to expose both hardware features and system configurations, I believe it is a good fit for storing configuration information like this.
@marquiz WDYT about NFD exposing Topology manager policy and scope as labels? Perhaps, we can expose Topology Manager policy options and CPU Manager policy/policy options as well?
Continuing this train of thought, I think we should expose not just the TM policy, but something that abstracts a bit the node behavior enforced by that scope. Random brainstorm:
topology.node.k8s.io/numa-alignment: strong
represents TM-policy single-numa-nodetopology.node.k8s.io/numa-alignment: best-effort
represents TM-policy best-effortnone
policy
how to represent restricted
? no clear idea yet :)scope is easier:
topology.node.k8s.io/numa-scope: pod
topology.node.k8s.io/numa-scope: container
TL;DR: labels should represent concepts enabled by TM config, should not reflect TM config 1:1
Continuing this train of thought, I think we should expose not just the TM policy, but something that abstracts a bit the node behavior enforced by that scope. Random brainstorm:
topology.node.k8s.io/numa-alignment: strong
represents TM-policy single-numa-nodetopology.node.k8s.io/numa-alignment: best-effort
represents TM-policy best-effort
What benefit do you see in abstracting the concept as opposed to representing the configurations 1:1? Representing the topology manager configuration or in general kubelet configuration directly without abstracting it gives a clear view of what is being represented in the label (kubelet configuration on the node) and is much more intuitive rather than having to interpret what strong
alignment is indicative of.
Abstracting the policies means that NFD has to translate the policies to an abstract label and some component has the responsibility of correctly interpreting it which makes me a bit nervous.
just a general comment, please stop using term "NUMA" anywhere in API and please don't put it in new ones. It is not technically correct and not matching neither existing on the marked hardware nor upcoming in near future.
just a general comment, please stop using term "NUMA" anywhere in API and please don't put it in new ones. It is not technically correct and not matching neither existing on the marked hardware nor upcoming in near future.
point taken.
proposed changes for
v1beta1
, to be discussedstatus
- NRT describes the status of resources on a node, so it makes sense to honor the spec/status split and to have the fields reported as statusnode-res-topo
tonrt