Open VishDev12 opened 2 years ago
Me being greedy
Part of the Provisioner's design is that the Provisioner defines a bounded universe for pods. This enables a separation of concerns between operators and developers, who may have different permissions in the cluster. Any taint (or node property, for that matter) that exists in the cluster must be defined by some provisioner.
Okay, that makes a lot of sense, I can see how not having the provisioner at least define the labels/taints will collapse that boundary. It was definitely a stretch in any case.
What are your thoughts on the taints though? Does this part still hold up from our earlier chat?
Taint specified in the provisioner:
- key: company.com/team
value: "*"
effect: NoSchedule
Toleration:
- key: company.com/team
value: datascience
effect: NoSchedule
I'm a fan of:
Taint specified in the provisioner (value is undefined):
- key: company.com/team
effect: NoSchedule
Toleration:
- key: company.com/team
value: datascience
effect: NoSchedule
Wildcards are fairly rare in k8s APIs, and instead, similar intent is expressed using lack of a value. e.g., "value is "" (unconstrained)", instead of "value can be * (anything)"
If we have an undefined value in the taint, how would the coexistence of isolated and non-isolated workloads work in this scenario?
Looking at a few cases:
Provisioner taint
- key: company.com/team
effect: NoSchedule
No Toleration
Taint generated on node
- key: company.com/team
effect: NoSchedule
Taint gets created on node, but without a value. Which perfectly syncs up with the current behaviour of what happens when a taint is specified in the provisioner without a value. We expect only pods with a matching toleration (only key needs to match) to be scheduled on the node.
The "*" value is just an indicator to differentiate the empty value case above from this case. As long as we have that differentiation in some form, we should be good. Let's call this the generated-taints
use case.
Provisioner taint
- key: company.com/team
value: "*"
effect: NoSchedule
No toleration
No taint is generated
The provisioner now allows generic, non-isolated workloads with no tolerations to run on nodes that have no taints specified.
If we conflate this generated-taints
case with with the empty value case above, we're saying that every node launched by the provisioner, at the very least, will have the taint specified without the value. Which means that generic workloads can only be scheduled if they have a matching toleration.
The ideal behaviour here is: Don't generate taints on the launched nodes if the pod toleration doesn't specify it.
Provisioner
- key: company.com/team
value: "*"
effect: NoSchedule
Toleration
- key: company.com/team
value: datascience
effect: NoSchedule
Taint generated on node
- key: company.com/team
value: datascience
effect: NoSchedule
Taint specified with wildcard value, but toleration isn't specified at all (no key, no value)
I'm not convinced the generated case is the behavior we want. In my opinion, if you are specifying a taint on your provisioner, even keeping wildcards in mind, your intent is that you will generate a taint on the node that is generated by the Provisioner. If you don't want this taint, you can create another Provisioner that doesn't contain taints.
In this case, I think that we should generate a taint without a toleration, the taint just should have an empty value.
Concerns with Empty Taint Value
My concern with using empty value has more with an empty value being a valid value for a taint. Meaning we are potentially conflating a user's intent to specify an empty value and a user's intent to specify a flexible value (maybe this isn't a common scenario and we can ignore it).
As it stands "*" is not a valid taint value on a node so we could use this as a wildcard value without conflating these two intents.
I agree with your concerns about empty taint values. That's its own valid case, and shouldn't be conflated with this new feature.
You can create another Provisioner that doesn't contain taints.
One of the points, when this discussion started, was to co-locate isolated and general purpose workloads within a single provisioner. And that's possible only when a node can be added without a taint so that any workload can be scheduled to it.
So any mechanism at all (doesn't have to be wildcards) that ensures that the empty value taint case and the generated taint case are separate will be perfect. Just so that there can be a subset of nodes without generated taints to schedule pods that don't have the matching tolerations.
Another point of note is that if we have this separation, we can use a single provisioner to support generated taints with different keys. Workloads can "choose" which taints should be on the nodes on which they get scheduled. For example:
key: company.com/team
key: client/name
key: product/name
key: modelTraining/trialName
The alternative would be to implement 4 provisioners in addition to the default provisioner just to support these taints. And the concept of selecting a random subset of these would just not be possible. For example, nodes dedicated to product A and client X, or team M and product B.
Question: Why can't we just add dummy tolerations to the workloads for the taints that won't be used? This would allow the use of a single provisioner.
product/name: general-purpose
and client/name: general
.Answer:
modelTraining/trialName
toleration, for example, only makes sense for a highly particular use-case, and forcing all other workloads to add a generic value for that toleration just to support that use-case wouldn't make sense.I realize I'm placing a lot of weight on making a single provisioner do all this work, but I feel that having fewer provisioners to maintain would be ideal. Without this, over time, I can see an explosion of provisioners in larger K8s clusters used by multiple teams, just to support different taints and for no other reason, which would be painful for maintenance.
This would be an amazing feature to have and would introduce incredible versatility to workload isolation and make it quite robust.
Love the discussion here. Looks like empty taints are allowed, so we'll need another way to express flexibility. Wildcard seems fine, since we have it elsewhere in the spec. I'm not a huge fan of the semantic, but I don't see a more natural signal.
@VishDev12
Another point of note is that if we have this separation, we can use a single provisioner to support generated taints with different keys. Workloads can "choose" which taints should be on the nodes on which they get scheduled.
Is there any reason why you can't just rely on a single very specific taint to do the workload isolation if you need more granular control over your workload isolation?
I understand that you may have to create 4 different provisioners (5 if we are including generic workloads) for these different kinds of isolation but this seems doable and maintainable and, as an added bonus, your Provisioners now represent their own world of a boundary between the types of workloads they can and can't deploy.
I'm not sure that I'm seeing the use-case for creating subsets of tolerations from a Provisioner that has a larger set of taints. For this one, could you help by giving me a more concrete use-case?
Of course @jonathan-innis, here's an example of what I was planning to use it for.
Team: A => Isolated; needs a dedicated node group Environment: Staging => Isolated; needs a dedicated node group Products: [1, 2, 3, 4, 5] => Unisolated; can share a node group Product: [6] => Isolated; needs a dedicated node group
Total node groups needed: 2
team: A
and environment: staging
team: A
, environment: staging
, and product: 6
team: A
and environment: staging
.
product
at all, even a generic one, because the first node group has no product taint.team: A
, environment: staging
, and product: 6
.Team: B => Isolated; needs a dedicated node group Environment: Dev => Unisolated; can share a node group Products: [1, 2, 3] => Unisolated; can share a node group Microservice of product 1: KFP => Isolated; needs a dedicated node group
team: B
team: B
and microservice: KFP
team: B
.team: B
and microservice: KFP
.We don't have to even consider the environment or product tolerations because the workloads are able to share nodes at those levels.
If we don't have the ability to let the provisioner discard generated taints based on the pod's lack of tolerations for it:
There will be an explosion of provisioners with different combinations of taints. For example, in this case: team: A
, environment: staging
, and product: 6
, we can't just have the product taint because a product can be in multiple environments. So we'd need a dedicated provisioner with the combination of these three taints.
We can get a version of the above working with a provisioner whose nodes will always have all four taints: team
, environment
, product
, and microservice
. But this will necessitate us having to figure out generic values to handle the shared node group case.
Products 1, 2, and 3 can share a node group? What's the shared value of the product toleration that will let them share a set of nodes? What dummy values do we insert into the microservice and environment tolerations? And who manages this and standardizes these dummy values so there's no confusion?
This complexity would only keep growing unless there's a way to allow for taints to "disappear" if the pod doesn't need them.
I've also only covered one scenario so far.
There's the scenario where a specific domain comes into the picture. For example, if we have Machine Learning workloads and we wanted to track them by trialName, we could use a trialName
taint and toleration. But then would we have to assign a dummy toleration for trialName
to every workload that uses that provisioner?
There's the scenario where time becomes a factor. Let's say two months after creating a provisioner with all four taints: team
, environment
, product
, and microservice
, I decided to add a taint called client
. We'd have to add a generic toleration to every existing workload on the provisioner nodes and synchronize it with the restart of every existing node in the provisioner so the new generic taint gets added to them.
There's also the argument against repetition. Except for highly specific scenarios with specific AMIs, etc, one provisoner can handle the vast majority of use cases. So is it justifiable to repeat the spec of every provisioner for nothing more than getting a new taint? And to add the task of keeping all these provisioners in sync to the cluster admin?
This, for example, is a part of my requirements, and I'd rather repeat it as few times as possible so that it doesn't become necessary to copy-paste and keep multiple provisioners in sync:
requirements:
# Prioritize spot, fallback to OD.
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
# Exclude instances in these categories.
- key: karpenter.k8s.aws/instance-category # We prevent instances with a GPU, etc from being provisioned.
operator: NotIn
values:
- g # GPUs
- p # Large GPUs
- inf # Inferentia acclerator
- dl # Gaudi accelerator
- cc # Cluster compute, for HPC.
- d # Dense storage, for parallel data processing
- i # Storage optimized
- im # Storage optimized, Graviton
- is # Storage optimized, Graviton
- h # Storage optimized, high throughput
- u # Absurdly high memory, for in-memory DBs
- vt # Video transcoding
- x # Xeon processors, high memory
- z # High mem, high CPU, high frequency
# Exclude instances in these families.
- key: karpenter.k8s.aws/instance-family
operator: NotIn
values:
- c1 # 1st gen
- c3 # 3rd gen
- c6i # Intel processors, twice the cost in Mumbai
- c6id
- m1 # 1st gen
- m2 # 2nd gen
- m3 # 3rd gen
- m5zn # High frequency
- m6i # Intel processors
- m6id
- r3 # 3rd gen
- r6i # Intel processors
- r6id
- t1 # 1st gen
- g3 # 3rd gen, old GPU
- g3s # 3rd gen, old GPU
# We'd rather have smaller instances to encourage aggressive deprovisioning.
# Prevent instances with more than 50 GB of memory from being launched.
- key: karpenter.k8s.aws/instance-memory
operator: Lt
values:
- "50000"
# Prevent instances with more than 24 CPUs from being launched.
- key: karpenter.k8s.aws/instance-cpu
operator: Lt
values:
- "25"
I'm wondering if a semantic like optionalTaints
is more appropriate here. I'm concerned with the *
semantic in the existing taints
field since there is an understanding at this point that anything that appears in the taints
or labels
fields will always get populated onto the nodes.
If we impose something like .spec.taints[*].optional
, we can imply that a taint value may not be applied to the node unless there is a corresponding toleration on the pod. This combined with the wildcard semantic could give you the behavior you are looking for.
taints:
# Only apply this taint if there is a matching toleration
- key: example.com/taint1
effect: NoSchedule
value: *
optional: true
# Apply this taint always to the node
- key: example.com/taint1
effect: NoSchedule
value: *
I'm concerned that just specifying the wildcard in the value
isn't intuitive enough for most users and may actually surprise a lot of people in practice.
As a user, I personally would find it surprising if I specifying a taint on my provisioner with a wildcard and had no taint appear on a node produced by the provisioner. Adding an optional
parameter to the taint makes it very clear that this taint will only appear if there is a toleration on the pod that can support it.
I'm curious to hear @ellistarn's thoughts on this though
- key: example.com/taint1
effect: NoSchedule
value: *
optional: true
This is definitely an option. Do we need value: "*"
if we also have optional: true
? These express the same intent in my mind.
- key: example.com/taint1
effect: NoSchedule
optional: true
That said, as I continue to think of this, I've been thinking about the semantic more generally. One way to think of `` is that it's any value, including the lack of a value. I think it might be logically reasonable for a "wildcard taint" to only be applied if a workload requests it, and that a valid value for a wildcard taint is to simply not exist. I'm weighing how (un)natural this semantic is against the cognitive burden of creating a new concept that must be understood by users.
Tentatively, I think I'm supportive of the wildcard approach, where taints may or may not be generated, depending on the tolerations of incoming pods.
These express the same intent in my mind.
I'm not sure they do. One is expressing the intent that I would like you to generate the value for me based on pod toleration, otherwise give me empty string value ""
. The other optional
is saying if the pod doesn't have the toleration, then don't generate a taint for the node.
One thing this prevents us from doing is creating an "optional" taint with a specified value. There's no way to have optional taints and constrain the value if we choose to just go with the pure wildcard semantic.
The other thing I was considering is if we want to extend the wildcard semantic to other things like effect, we are not able to do this without using something like optional
since the optionality of the taint will be communicated by the value: *
semantic.
@ellistarn and I discussed offline and decided based on the use-cases as you described them @VishDev12 that the wildcard semantic (which can also mean no taint if there is a pod that does not have a toleration for the matching taint key):
- key: example.com/taint1
effect: NoSchedule
value: *
makes the most sense going forward. I'll be working on trying to get this in as a feature add into the project.
That's awesome! Thank you both!
@VishDev12 There's been more discussion around the PR I put up with the maintainers, specifically around the non-deterministic nature of the dynamic/generated taints proposal. Take the following example given a single provisioner with the proposal as described:
_Provisioner Taints
taints:
- key: team-name
effect: NoSchedule
value: *
Pending Pods
metadata:
name: pod1
spec:
tolerations: []
metadata:
name: pod2
spec:
tolerations:
- key: team-name
value: team-a
...
If pod1
goes through our scheduler algorithm first, we run into an issue. We will provision a node with no taints because the taints were optional (as specified above). Then, when the second pod pod2
comes to be scheduled, we will schedule it to the same node because it can fit on that node. This is most likely not what was intended by the user.
In the reverse case where pod2
goes through the scheduling loop first, we will create two nodes (one with the team-name
toleration and the other with no tolerations).
We see this as a concern since behaviors can randomly diverge depending on how pods come into the scheduler. We are wondering if the following solves your use-case of enforcing isolation
nodeSelector
or requiredNodeAffinity
is set on that same workload pod to enforce the isolationProvisioner
requirements:
- key: "team-name"
operator: Exists
taints:
- key: "team-name"
effect: NoSchedule
Pending Pods
...
nodeSelector:
team-name: team-a
tolerations:
- key: team-name
operator: Exists
...
nodeSelector:
team-name: team-b
tolerations:
- key: team-name
operator: Exists
In this case, you would still have to do a couple things:
nodeSelectors
on their workload podsIdeally, this would solve your isolation scenario without requiring any feature adds to the current set of controllers.
The remaining concern from the Provisioner spec you specified above is the need to continually re-specify the same set of requirements across different Provisioners. This might be solved by an entirely different mechanism, something like a RequirementsClass
that we could add to the API to prevent the need for re-specification of common requirements.
Thank you for the explanation, but please allow me to take a bit more of your time. I understand this concern:
We see this as a concern since behaviors can randomly diverge depending on how pods come into the scheduler.
But this was expected behaviour ā dynamic taints are only one part of the story. Just like dynamic labels and node-selectors don't enforce node isolation, dynamic taints and tolerations won't enforce node selection.
In the use case I laid out, one provisioner would've been used both for the no-taints case and the generated taints case, so the expectation was always that both dynamic labels/node-selectors and dynamic tolerations/taints would be required to get the desired behaviour.
So, if the user adds to their workload:
If any of the two points above seems like it doesn't make sense (the second point certainly seems without purpose), it's because they're meant to be two opposing parts that make up a single story, one to enforce selection and one to enforce isolation, and together guarantee complete isolation. They have to be used together, just like labels/node-selectors and taints/tolerations are used in regular EKS or Spotinst Ocean node groups to guarantee isolation.
I think the users that make use of dynamic taints will have a clear understanding of this fact if they ever reach that point in their scheduling journey, especially since wildcards aren't something they could add by accident to their provisioner taints.
I should have definitely elaborated on this point, but the intent was captured in this line on the first message of this thread:
It seems that in addition to dynamic labelling, a mechanism to dynamically taint the nodes would help guarantee workload isolation at the node level.
Just to complete the picture:
Provisioner
requirements:
- key: "team-name"
operator: Exists
taints:
- key: "team-name"
value: *
effect: NoSchedule
Pod Spec
nodeSelector:
team-name: team-a
tolerations:
- key: team-name
value: team-a
What is the gap missing in the picture above
- Create generic taints for isolation "categories" when you want to isolate based on some value
- When specifying a generic taint for a category, specify the matching generic toleration on the workload pod and ensure that a nodeSelector or requiredNodeAffinity is set on that same workload pod to enforce the isolation
Is there a specific need to specify taint values exactly or can you not guarantee that when a workload is expected to be isolated it contains both a generic taint and the relevant nodeSelector
for which of the tainted nodes it should be scheduled to.
That's definitely a workable solution and is what I'm using right now to separate workloads in a dedicated provisioner. The main issue is just that the premise of this issue is lost, which was colocating generic and isolated workloads within a single provisioner. This will no longer be possible since it requires the creation of provisioners for each combination of taints as you've already pointed out. But as you also noted, that burden can be lessened in a different way through something like a RequirementsClass
.
I'll be happy to continue with the existing functionality at the moment if you and the team think dynamic taint support is a risky addition. I want to make it clear that I'm not blocked in any way; I just believe that this is a great enhancement to Karpenter.
Though I hope we revisit it sometime in the future! My personal opinion is that dynamic tainting is deterministic and just needs the right note and usage warnings as laid out here to accompany the wildcard taint feature description in the docs.
Thank you for all your support and the discussion!
which was colocating generic and isolated workloads within a single provisioner
Since we are distilling this issue down to this use-case, can you provide some details around the number of different provisioners that you require right now since you are currently having to specify a different provisioner for each combination of taints?
My personal opinion is that dynamic tainting is deterministic and just needs the right note
I agree that with nodeSelectors
there can be determinism in the scheduling; however, it seems that this determinism falls apart when nodeSelectors
aren't specifying which strictly isolate workloads.
It's a bit hard to put a number to it, because it varies based on the availability of dynamic taints.
Even at the initial stages of use, I'd be using 10-15 different dynamic labels and taints. Since it's just one provisioner, I'd have no regard for the different possible subsets and be quite liberal with this.
Because each combination of dynamic labels/taints requires a provisioner, this can cause a combinatorial explosion. So, I'd be far more conservative and focus heavily on the most granular level of specification, so I'd say 10-15 provisioners at max, which come mostly from combinations of 5-7 different dynamic labels/taints.
If I went with the same behaviour as the previous case, I'd be easily looking at 20-30 provisioners for the combinations, which seems hard to manage and make sense of.
it seems that this determinism falls apart when nodeSelectors aren't specifying which strictly isolate workloads
Yeah... But to be fair, can the misuse of a feature be called nondeterminism? Because Karpenter and the scheduler would be doing precisely as instructed, so the onus should be on the user to specify both the nodeselectors and tolerations.
Karpenter taking on this responsibility might introduce an unnecessary coupling between the nodeselector and toleration, if we force the user to specify both for example.
Also, this could be thought of as a feature and not a bug. A sort of soft preference:
If you have already nodes that can tolerate these pods, schedule these pods there. If not, I'd rather you spin up nodes dedicated to these workloads.
(Okay the last part was a stretch, but I had to try)
was there any more movement on this discussion? we now have two different workloads that need to run on isolated nodes from the other workloads, but the provisioner is exactly the same on all 3. so having so much extra code is just asking for trouble on future modifications and a lot of toil. if we were able to just taint those nodes based on the workloads, then those nodes would be used just by these workloads without impact for the standard workloads. thanks
@FernandoMiguel Do you have any thoughts on how you would expect some of the edge-cases to behave in the case where the ordering of pods causes non-deterministic behavior similar to the scenarios that are listed in the discussion above?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Going to revive this thread
Do you have any thoughts on how you would expect some of the edge-cases to behave in the case where the ordering of pods causes non-deterministic behavior similar to the scenarios that are listed in the discussion above?
I can give some insight into what I am looking into @jonathan-innis.
I have a larger use case, I am talking about hundreds of Node Pool vs 1-2 Node Pool because we want to provide isolation between "services".
IMO the non-deterministic behavior is expected because the state of the world and what the pod needs were fundamentally different. The reason for this is if a pod is asking to land on a node with no taints because it has no toleration we need a node without taints. This also means that there is also no expectation for this pod to land on an isolated node because it doesn't have toleration. On the other hand since the second pod didn't have a node selector, there is no expectation on the node it runs on hence it ran on a node with another pod from a different team.
I don't see this as a bug nor a feature because there is a difference between what taint does and what node selector does and we shouldn't expect convergence of behavior just because order of operation is different nor should we solve for convergence of these behaviors. This feels like saying we have "inconsistent" behavior if we set node affinity and node selector differently when they behave differently and have different expectation. To me this isn't a non-deterministic behavior, it is a deterministic behavior just different state of the world
Adding a TLDR on this thread for those who are coming back to this more recently. When you need to ensure different tenants don't schedule onto the same node, you can orchestrate this different ways.
Both of the above workarounds have downsides of moving further away from Karpenter best practices. If Karpenter supports a *
value in their taints, Karpenter could generally say that any pod with a matching toleration key will be compatible, where the value of the taint on the node is then the value on the pod's toleration. Yet, this only ensures isolation if pods are also selecting on a set of compatible NodePools, preventing bad acting tolerating pods from racing and potentially resulting in nodes without taints (not ensuring tenant-isolation)
Tell us about your request
Using this scheduling technique, it's possible to add a custom label to a node based on a corresponding key/value pair specified in the node-selector or node-affinity. This allows us to have nodes from a single provisioner that have differing custom labels based on the workloads that get scheduled on them, and ultimately, this allows for workload isolation at the node level.
But this is only true if every pod using that provisioner specifies that custom label in the node-selector or node-affinity. If a workload doesn't specify it, it can now be scheduled on any node launched by the provisioner and isolation is no longer a guarantee.
It seems that in addition to dynamic labelling, a mechanism to dynamically taint the nodes would help guarantee workload isolation at the node level. This would be critical for multi-tenant scenarios where it's one thing to say that isolation is 'possible', but with this feature, isolation would be inviolable.
@ellistarn had the suggestion of defining taints in the provisioner with an empty or '*' value and then specifying the corresponding tolerations in the workload.
Taint specified in the provisioner:
Toleration:
If the corresponding toleration is not specified in the workload, the understanding is that the node on which it will be scheduled never gets a taint. This allows both isolated and non-isolated workloads to use the same provisioner.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In a multi-tenant scenario, we prefer to have workloads cleanly segregated by nodes. Currently this is achievable by creating a new provisioner where the clear expectation is that everyone using it is expected to either have the custom label specified in the node-selector or node-affinity, and this would be what determines the workload isolation because their workloads would always be scheduled on nodes with matching labels. But what would be more ideal is to be able to do this directly from the default provisioner where both isolated and non-isolated workloads are able to coexist with it being impossible for the non-isolated pods to be scheduled on the isolated nodes.
Me being greedy
The above would be more than cool enough already. But if it's possible to do away with the necessity of specifying the custom labels or taints on the provisioner, that would be the ultimate level of flexibility. Unlimited node groups controlled by nothing but the workloads. Perhaps something along these lines:
The expectation being that when Karpenter sees 'customLabel/' in the nodeSelector, it understands that it needs to generate a corresponding node label. The same would apply for generating taints when it sees the 'customLabel/' prefix in the tolerations.
Community Note