feature to allow optionally setting taints based on node properties

rptaylor commented 3 years ago

What would you like to be added:

It would be nice if NFD could be configured with options to set node taints as well as labels, based on certain features of nodes. Would you consider that in scope of NFD?

Why is this needed: Cluster operators may wish to automatically taint nodes with certain features, for example tainting a node that has GPUs to prevent other pods from running on it if they don't actually need (tolerate) the GPUs.

marquiz commented 3 years ago

Hi @rptaylor. Yes, this would be useful and I've been thinking this myself as part of the work I've done on #464 and particularly #468 which are very much on prototype level, still.

I've done some initial experiments and started to think whether it should be possible to taint only some of the nodes (a configurable proportion). WDYT? This complicates things (in implementation) quite a bit, though so maybe that would be a future enhancement.

rptaylor commented 3 years ago

Okay, nice @marquiz . It makes sense to me that NFD could have the flexibility and generalization to apply arbitrary properties (taints as well as labels) based on the features of nodes.

What would be the use case to only taint a portion of nodes with a given feature and configuration?

marquiz commented 3 years ago

What would be the use case to only taint a portion of nodes with a given feature and configuration?

Reserving some of the nodes for general usage or alternatively reserving only a fraction of the nodes for special workloads. Dunno if that is useful in practice 🤔

zvonkok commented 3 years ago

To taint only a subset of nodes in a cluster makes kinda sense for extended resources. If a Pod is allocating an extended resource, the ExtendedResourceAdmissionWebhook will automatically tolerate the extended resource taint.

But we need to be careful when we're applying this taint. In some cases of hardware enablement usually adding the taint is when the extended resources are exposed.

It also depends on how you want to partition your cluster. We used taints and tolerations for a "hard-partitioning" meaning no workloads allowed that are not tolerating the taint, repelling workloads.

Or using "soft-partitioning" e.g. with priority classes to have mixed workloads but special workloads could have higher priorities.

Another use-case would be behavioural partitioning, let's say e.g. you have one cluster and want to do some AI/ML pipeline, one could imagine to taint some nodes as inference and others as training or data-lake resembling a pipeline in one cluster rather then having several clusters each for "one" specific feature.

rptaylor commented 3 years ago

If the extended resources are equivalent on a number of nodes, making the nodes fungible, it doesn't make sense to me to divide them into separate hard partitions. In a traditional batch system partitioning creates significant challenges in practice, especially at large scales, and this would be handled by fair-share scheduling instead, but that is a big missing feature of Kubernetes (I think Volcano may have this). PriorityClasses are not enough.

It is a fundamental trade-off in scheduling theory between latency and throughput; partitioning will inevitably reduce usage efficiency (throughput) but can improve latency (nodes are reserved for you so available right away), but that has to be balanced against the risk of filling up your partition and the probably larger benefit of being able to use other partitions when available, if all nodes were in the same shared pool instead.

Even with a relatively steady state workload (as opposed to dynamic and bursty), wouldn't it be better to use resource quotas for each app (inference/training/etc) as a floating reservation across any available node rather than locking certain apps to a specific subset of nodes? Anyway my perspective is from a scientific HPC background, other situations could have totally different needs and considerations that I am not familiar with. Best to build a tool that provides sufficient options so anyone can use it however they need :)

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

marquiz commented 2 years ago

I have plans to implement this on top of #553.

@rptaylor I think I agree with you above. Partial/proportional tainting is much more complicated with problematic corner cases (e.g. with cluster auto-scaling), not to mention the problems of optimal scheduling and resource usage you talked about above.

/remove-lifecycle stale

marquiz commented 2 years ago

For consistency, we'd need to support this for both nfd-worker config (configuration of custom source), I think. This means that we need to update our gRPC interface, too, to send the taints from worker to master. Also, we prolly need to add an annotation for bookkeeping, (similar to nfd.node.kubernetes.io/feature-labels and nfd.node.kubernetes.io/extended-resources)

marquiz commented 2 years ago

Moving to v0.12.0

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

marquiz commented 2 years ago

We still want this. Not a huge deal in terms of implementation but somebody® just has to do it

/remove-lifecycle stale

fmuyassarov commented 1 year ago

I'm interested to work on this. /assign

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fmuyassarov commented 1 year ago

/remove-lifecycle stale /lifecycle active

fmuyassarov commented 1 year ago

this is being reviewed right now in https://github.com/kubernetes-sigs/node-feature-discovery/pull/910

kubernetes-sigs / node-feature-discovery

feature to allow optionally setting taints based on node properties #540