v1alpha2 planning - Githubissues

ffromani commented 1 year ago

proposed changes for v1alpha2, to be discussed

[X] ~add constants for values we use repeatedly, like "Node" (and possibly more). Also add constants for max supported NUMA zones (64)~ these are scheduler/topology manager constraints or constants, the API itself does not have any requirement here so we are deferring any further evaluation to at least v1alpha3
[X] add a top-level node Attributes field, like we already have for Zones
[X] deprecate topologyPolicy (remove in v1beta1). Document how node-local Topology Manager configuration can be expressed (~labels~ the newly added Attributes)
[X] ~find a way to expose TM policy options as labels (extension of point 2 above)~ Expose TM policy options as top-level Attributes object

spin off from https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/issues/21 to enable more incremental evolution

ffromani commented 1 year ago

In v1alpha1, the NodeResourceTopology (NRT) object expose a field TopologyPolicies at object level (= at compute node level) which maps 1:1 to the combination of the kubelet topology manager scope and policy configured on the node. See here for details.

This data is exposed to make the scheduler plugin, the sole consumer of this information, easier. The scheduler plugin uses this information to use the same behavior the Topology Manager (TM) would use on the node, so to filter out and/or score the nodes appropriately.

Additionally, this field is (planned to be) used to group together nodes using the same TM configuration. The scheduler can verify nodes are properly grouped by the same TM config checking this value. This verification is not very strong, but is the only available data the scheduler can consume.

Is worth to stress that the primary intended use, by far and large, is to run the same logic as TM does on the scheduler side.

The information about the TM configuration however doesn't really belong to NRT. NRT should report only resources, not configuration data. We can very much get the same information using a more kubernetes-native mechanism: labels. updating agents (RTE, NFD...) can add labels to convey this information. The pending missing item before being able to implement this design, and thus deprecate and then remove the TopologyPolicies field is

should labels represent 1:1 configuration option names and values from kubelet configuration or rather
should labels abstract the kubelet configuration into their intended behavior?

Pros and cons: exposing kubelet configuration duplicating kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: leaks and duplicates kubelet configuration details CON: needs to be kept in sync with kubelet

just expose the kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: adds another dependency: clients need to consume this API and the kubeletconfiguration API

new set of labels derived from kubelet configuration PRO: decoupled from kubelet configuration CON: translation layer has a cost (documentation, implementation) ???: still needs to be kept in sync with kubelet, but less urgently

ffromani commented 1 year ago

In v1alpha1, the NodeResourceTopology (NRT) object expose a field TopologyPolicies at object level (= at compute node level) which maps 1:1 to the combination of the kubelet topology manager scope and policy configured on the node. See here for details.

This data is exposed to make the scheduler plugin, the sole consumer of this information, easier. The scheduler plugin uses this information to use the same behavior the Topology Manager (TM) would use on the node, so to filter out and/or score the nodes appropriately.

Additionally, this field is (planned to be) used to group together nodes using the same TM configuration. The scheduler can verify nodes are properly grouped by the same TM config checking this value. This verification is not very strong, but is the only available data the scheduler can consume.

Is worth to stress that the primary intended use, by far and large, is to run the same logic as TM does on the scheduler side.

The information about the TM configuration however doesn't really belong to NRT. NRT should report only resources, not configuration data. We can very much get the same information using a more kubernetes-native mechanism: labels. updating agents (RTE, NFD...) can add labels to convey this information. The pending missing item before being able to implement this design, and thus deprecate and then remove the TopologyPolicies field is
1. should labels represent 1:1 configuration option names and values from kubelet configuration or rather

2. should labels abstract the kubelet configuration into their intended behavior?
Pros and cons: exposing kubelet configuration duplicating kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: leaks and duplicates kubelet configuration details CON: needs to be kept in sync with kubelet

just expose the kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: adds another dependency: clients need to consume this API and the kubeletconfiguration API

new set of labels derived from kubelet configuration PRO: decoupled from kubelet configuration CON: translation layer has a cost (documentation, implementation) ???: still needs to be kept in sync with kubelet, but less urgently

my take: NRT wants to generalize what TM does, so binding ourselves to TM is a step back. For these reasons it seems better not to depend directly on kubelet configuration, we should define our labels and our enum values for these labels.

The TM scope concept is generic and forward-looking enough concept we can translate into something like

topology.node.k8s.io/alignment-scope =["container", "pod"]

(or resource-alignment-scope?)

a less mature proposal for expressing the TM policy constraints can be

topology.node.k8s.io/alignment-constraint =["none", "minimize", "full"]

(values very much subject to change) full -> "fully constrained" -> single numa node minimize -> "use the minimal number of zones" -> restricted none (or no label) -> "best-effort" or "none" (aka no guarantees/constraints)

PiotrProkop commented 1 year ago

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels. Something like topology.node.k8s.io/alignment-hints? And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

ffromani commented 1 year ago

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels. Something like topology.node.k8s.io/alignment-hints? And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

PiotrProkop commented 1 year ago

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels. Something like topology.node.k8s.io/alignment-hints? And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

Right now it's easy cause we only have one option 😄 I feel like there won't be a lot of them in the future thanks to DRA and NRI so I am happy with exposing only selected few and we should include only prefer-closest-numa-nodes in initial support.

ffromani commented 1 year ago

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

Right now it's easy cause we only have one option smile I feel like there won't be a lot of them in the future thanks to DRA and NRI so I am happy with exposing only selected few and we should include only prefer-closest-numa-nodes in initial support.

Thanks. This is going to be a balancing act. We're shaping up the label API and the design direction with help and feedback from NFD maintainers, I'll keep you in this loop tagging you on GH discussions as needed.

ffromani commented 1 year ago

Here's the (very) initial proposal, so we can look at some code and start to get a feeling of how it could look like: https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/pull/25

marquiz commented 1 year ago

I'm just wondering are labels the way we want to advertise this information? Just a thought but could the recently introduced NodeFeature CRD in NFD be used for this?

ffromani commented 1 year ago

I'm just wondering are labels the way we want to advertise this information? Just a thought but could the recently introduced NodeFeature CRD in NFD be used for this?

I think what we need is a way to categorize and group nodes using common attributes. Labels can be a way to do this but I'm (and I think we) are very open to alternatives. I'm not up to speed with the NodeFeature CRD so I'll happily read about this.

marquiz commented 1 year ago

BTW, one unrelated improvement in the types would be to enumerate also other ZoneTypes than jyst "NUMA", e.g. all CPU topology levels identified by the Linux kernel, i.e. package, die, cluster (a bit overloaded term in K8s context 😅), core(?)

kad commented 1 year ago

BTW, one unrelated improvement in the types would be to enumerate also other ZoneTypes than jyst "NUMA", e.g. all CPU topology levels identified by the Linux kernel, i.e. package, die, cluster (a bit overloaded term in K8s context sweat_smile), core(?)

probably instead of "numa" -> "memory node" or something similar, emphasis on memory partitioning. For package/die/cluster/core -> those should be something like "cpu xxxx" to emphasize partitioning by cpu topology.

swatisehgal commented 1 year ago

add constants for values we use repeatedly, like "Node" (and possibly more). Also add constants for max supported NUMA zones (64)

I think this makes sense in general and we should do this but I would like to point out that the maximum supported NUMA zones is 8 as can be seen the topology manager documentation.

should labels represent 1:1 configuration option names and values from kubelet configuration or rather should labels abstract the kubelet configuration into their intended behavior?

new set of labels derived from kubelet configuration PRO: decoupled from kubelet configuration CON: translation layer has a cost (documentation, implementation) ???: still needs to be kept in sync with kubelet, but less urgently

my take: NRT wants to generalize what TM does, so binding ourselves to TM is a step back. For these reasons it seems better not to depend directly on kubelet configuration, we should define our labels and our enum values for these labels.

Can you elaborate on why you think using TM policies is a step back? We are running an algorithm in the scheduler plugin to approximate the allocation decision make by Topology Manager in kubelet. I am noticing that we are trying to decouple from kubelet configuration but I am probably not fully understanding why that is a problem. I am open to learn and get a better understanding of this.

I am biased towards labels with a 1:1 mapping between the configured kubelet policy/policy options and the created labels than abstracted versions of them. It seems more intuitive and saves us from having to translate and maintain additional documentation. Maybe it's just me but I feel that it might cause confusion and would be hard to explain that the labels exposed by NFD are related to Topology Manager configuration but still don't match with the configured values? What if someone independent of Topology aware scheduling usecase wants to use these labels to target a workload to be scheduled on a node that has a specific topology manager policy configured? If we use these new labels, we would be diverging from using well recognized policies that is documented in the officical kubernetes documentation.

When it comes to keeping the values synced with kubelet, since there are planning to graduating Topology Manager to GA, I expect the names of existing policies and policyoptions to continue to exist in the near future. If new policies or policy options are added, we would have to take those into account regardless of whether we want to have a 1:1 mapping or an abstracted label representation.

Maybe the abstraction of the polices is motivated by the fact that there are ongoing discussions is the community about making resource management pluggable in kubelet for more flexibility? Thinking out loud here but if that is the case, we would have to figure out the fate of topology manager and its policies moving forward. I expect support for resource managers and existing polices be supported in next few release (especially now that there are plans of graduating it to GA). A clear migration path would have to be identified in the community in case responsibility is offloaded to resource plugins and resource managers are deprecated.

Having said that, I recognize we are heading in the right direction by removing the topologypolicies from the NRT API to NFD where it is more suitable so don't want to slow down the momentum. We can pick an approach that everyone prefers and pivot later if needed.

ffromani commented 1 year ago

Thanks @swatisehgal , yours is a good point. I'm reconsidering a bit the approach about label representation, also taking into account the previous feedback from @marquiz . I'll take the minimum time to properly formalize my thoughts and I'll update this thread.

ffromani commented 1 year ago

I had a offline chat with others including @kad and we agreed that the API should not define the labels after all. It should be a contract between the producer and the consumer, and it's at very least too early to have labels and values frozen in the API, if we should do this at all.

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to the proposal (and into #25).

The combinations of these two factors enables a nice upgrade path: NFD (or RTE or any other agent) can add the TM configuration of the node as top-level attributes which the scheduler plugin (or any other consumer) can read and use.

fmuyassarov commented 1 year ago

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to t> he proposal (and into https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/pull/25).

Also, I guess the ttl time we discussed on the call for the cleaning stale NRT CRs can go in the top level attributes too. Also just a reminder from the same call, it would be good to add a short name to the CRD, probably nrt.

ffromani commented 1 year ago

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to t> he proposal (and into #25).

Also, I guess the ttl time we discussed on the call for the cleaning stale NRT CRs can go in the top level attributes too.

Yes. We'are going to add other extra fields as well.

Also just a reminder from the same call, it would be good to add a short name to the CRD, probably nrt.

Planned for v1beta1

ffromani commented 1 year ago

closed by #25

k8stopologyawareschedwg / noderesourcetopology-api

v1alpha2 planning #24