Open stevehipwell opened 2 years ago
Does annotating the pods of your longrunning jobs with karpenter.sh/do-not-evict
work for you?
@ellistarn wouldn't this allow more jobs to schedule onto the node and basically make it unterminatable?
The node will still expire and cordon, but it won't start draining until the pod with those annotations are done. @njtran can you confirm?
Yep! If you have a pod with a do-not-evict
annotation, Karpenter will not drain any of the pods on a node, but it will still cordon the node.
A node is not drained if it has a do-not-evict
pod, and we will not terminate a node until that node is considered drained:
Does this fit your use case?
@njtran this fits my use case happy path perfectly, but it opens up a separate concern that the do-not-evict
annotation could keep a node running indefinitely? If there isn't a way to do this already I'm happy to close this issue as being a "didn't read the docs correctly" and open a new issue regarding a do-not-evict
grace period?
Sorry for the delay! You're correct in that if a do-not-evict
pod exists, then Karpenter would never try to scale down the node.
open a new issue regarding a do-not-evict grace period?
Could you expand on what you mean here? What's the use-case for having an additional grace period for a do-not-evict
pod? You could programmatically delete these pods once they've completed, which in that case, the node would then be able to be deleted if there were no other do-not-evict
pods.
@njtran fundamentally it comes down to reducing the blast radius. If anyone who can start a container can keep a node running that starts to look like a potential problem. I'd like to know that after a given time, no matter what a user has done, that a node will terminate and be replaced.
additional grace period for a do-not-evict
Is there a difference between this and terminationGracePeriodSeconds
?
I'd like to know that after a given time, no matter what a user has done, that a node will terminate and be replaced.
Even if it would violate a pod disruption budget?
Is there a difference between this and terminationGracePeriodSeconds?
@ellistarn this would terminate the node even if there was a pod with a karpenter.sh/do-not-evict
annotation running. Effectively after the TTL expires the annotation would be ignored.
Even if it would violate a pod disruption budget?
@tzneal I think that's a separate concern, but pretty important. A TTL which guarantees the node is terminated would make sense, having pods stuck due to user error and blocking node maintenance is a common problem when running a platform.
Renamed this issue to be more accurate of the discussion above
Yep! If you have a pod with a do-not-evict annotation, Karpenter will not drain any of the pods on a node, but it will still cordon the node.
@njtran Curious if this is still an expected behaviour. When enabling consolidation, I experienced Nodes being cordoned but not drained. I'm wondering if this annotation on a Pod is what's possibly happening.
A TTL which guarantees the node is terminated would make sense, having pods stuck due to user error and blocking node maintenance is a common problem when running a platform.
+1 a ttl would be helpful.
@alekhrycaiko sorry for the late response here, you should be able to see the kube-events that show if a node was unable to be deprovisioned due to a blocking pod.
And @stevehipwell sorry for the late response! This makes sense to me, as it essentially guarantees (besides involuntary disruptions) that a workload will be up for at least the duration it's decided to be. This would require some design, and would be great for anyone to take up implementing if they want some more experience in the code.
Thinking about it more, is there a reason you can't rely on activeDeadlineSeconds
for this? https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
If a pod is terminal, we ignore the do-not-evict pod as well. So is there something other than jobs that you want a do-not-evict
TTL for?
@njtran this functionality request is to taint a node so no further pods are scheduled on it allowing either the running pods to finish or the expired TTL to trigger the pod(s) to be killed. So a use case would be to have a grace period difference between the taint TTL and the expired TTL to support clean node terminations when pod age is within a given range which is lower than the grace period; for example a CI/CD system.
Ah, I think I fully understand now. So this TTLSecondsUntilTainted (or maybe better named TTLSecondsUntilCordoned) would act as a programmatic schedulable window for all nodes. If you combine this with ttlSecondsUntilExpired
, then you get a grace period of time where you can guarantee your workloads won't be expired (although you can't guarantee it for other disruptions). Am I right?
Since the only disruption mechanism that is driven by time is expiration, this TTLUntilCordoned seems more likely to make it harder for Consolidation to help cost, and without any enabled disruption mechanisms, this could result in a lot of unused capacity. I think this would be better driven through at the application level, with using activeDeadlineSeconds
for jobs with do-not-evict
or adding a do-not-evict-ignore-ttl
if you there's a reason for pods that aren't jobs.
@njtran for the use case I'm referring to the pods are explicitly not jobs and are expensive to re-run, for example a CI/CD system running E2E tests where terminating the pod would have a high cost. The desired behaviour would be for Karpenter to be able to cordon a node to stop any new pods being scheduled while allowing existing pods to either complete successfully or timeout before the node is replaced. It might be best to drive this as a pod label such as cordon-timeout-seconds: "120"
. Consolidation could then de-prioritise nodes with pods labeled for this behaviour.
My apologies on the length of time on this discussion, I've lost track of this issue many times 😞.
If the pods are explicitly not jobs, where terminating the pod would have a high cost, how do you coordinate when the pod is okay to be evicted? Is there some special semantic for this CI/CD pod that signifies completion to another controller in your cluster that isn't represented by a kubernetes-native construct?
It seems odd to me to create a node-level time-based restriction on scheduling, which is actually motivated by a pod-level time-based restriction. While you can get into a situation where a node is not optimally utilized from unluckiness with how these pods get rescheduled, if you're eventually expecting the node to be expired, since Karpenter doesn't cordon the node, the node should still be utilized. Once no pods have scheduled to it, then Karpenter could execute expiration. But you're right that if do-not-evict
pods unluckily always opt out the node of deprovisioning, Karpenter may never deprovision this node.
In my eyes, given the tight correlation with Consolidation and Expiration here, I'm not convinced that this is something that makes sense to include in the API. I think this would better be modeled as https://github.com/aws/karpenter-core/issues/622. Thoughts @stevehipwell ?
I think that aws/karpenter-core#622 would probably solve enough of this use case to be viable.
Architecturally speaking though I'm not 100% convinced that a pod (and therefore potentially a low permissioned user) should be able to impact and effectively own a node lifecycle, as is the case currently and even if aws/karpenter-core#622 is implemented; this looks like a privilege escalation to me. By providing a mechanism for the cluster admin to set a max node ttl after it's lifecycle should have been ended this issue is addressed.
should be able to impact and effectively own a node lifecycle
Can you try controlling this with something like Kyverno or some other admission webhook mechanism that would block "normal" users from setting this annotation value on a pod to avoid the privilege escalation?
Can you try controlling this with something like Kyverno or some other admission webhook mechanism that would block "normal" users from setting this annotation value on a pod to avoid the privilege escalation?
I can and likely will run OPA Gatekeeper to control this when we get Karpenter into production; but not everyone who runs Karpenter will have any policy capability.
Hi there, I hope im barging into the right discussion here. if no, I am truly sorry. I want to give an example of a case to see if that makes any sense.
I have a karpenter provisioner with consolidation disabled, that holds a statefulset with 4 replicas on 4 nodes. the pods have a do-not-evict annotation because I do not want unwanted evictions if a new AMI is released or changes happen.
all PDB's are in place. in case i need to upgrade my worker nodes (cluster upgrades or ami upgrades) a restart to the statefulset will start by terminating the pod, then wait for karptenter to catch the pending pod and and spin up a node. if i could use the drift feature the node will be ready beforehand and expedite the process.
I would love to have a way to "trigger "a drift on nodes so i can replace them gracefully. with an annotation or something of the sort. is that possible and im missing anything? is that planned?
Karpenter v0.30.0
I've been reading through the design document referenced above (https://github.com/aws/karpenter-core/pull/516) and read that to mean the controls for when a pod is going to be considered for eviction are placed on the NodePool. We have pods that take 20 minutes to start up and checkpoint every 5 minutes after that. The gracePeriod mentioned in this issue on the pod would allow us to block eviction of the pod for a significant time for the task to get traction before being considered for consolidation. Having it on the NodePool level would mean that each workload that has a long start up time, and checkpoint would need its own NodePool. We experimented with separating our workloads into node pools and saw a more than 50% increase in our cluster costs immediately.
Please forgive me if this is not the right place for the above feedback, willing to move the conversation to the appropriate venue.
read that to mean the controls for when a pod is going to be considered for eviction are placed on the NodePool
Not necessarily, in the context of this issue, I could imagine the solution changing the value of the annotation to a time.Duration
formatted string, which Karpenter could use with CreationTimestamp to know when it can be deleted. Yet, for NodePool-level controls, I could see something like a minNodeLifetime
be included, which would opt out nodes that aren't old enough for disruption.
We experimented with separating our workloads into node pools and saw a more than 50% increase in our cluster costs immediately.
This seems wrong, or at least unexpected. Would be good to figure out if there's something wrong here. Can you open an issue for this? Or let's sync up on slack. I think another issue would be a better place to track this.
Copying a conversation over from https://github.com/kubernetes-sigs/karpenter/pull/926#discussion_r1444156935. Playing around with passing this same grace period down to a node from the NodeClaim, since it's possible to use the karpenter.sh/do-not-disrupt
annotation across all NodeClaims/Nodes in v1beta1 🤔
I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life"
We have this as an annotation that can get propagated onto the Node from the NodeClaim today. Seems like this might be a really nice place to do #752 and push the
karpenter.sh/do-not-disrupt
annotation that can either be configured at the Pod spec level or at the NodeClaim level (which could be passed down from NodePoolspec.template.metadata.annotations
so you could do something like):spec: template: metadata: annotations: karpenter.sh/do-not-disrupt: 10m
Love this. Applied at both pod and node level? Could make a new annotation alongside the existing once called disruptAfter
Could make a new annotation alongside the existing once called disruptAfter
Any reason you think that it should be a new annotation vs. a value for the existing one?
new annotation
Just conceptual alignment, at the cost of a new indefinitely deprecated annotation.
A few things to consider:
It may not make sense to try to align all of these concepts, but worth exploring.
I like the idea we can just set do-not-disrupt=5m
on the nodepool under annotations and the idea of the do not disrupt annotation will be propagated to all nodes in the nodepool.
If the case is for more temporary actions one can set it on the nodes themselves manually. Disruption control at the node level is a very useful tool to have at your disposal, so forcing it into the nodepool api itself removes a bit of that freedom.
It gives more control on the annotation level vs just sticking somewhere in the api where that control is tied to the nodepool.
One could also argue that having something that controls disruption outside of the disruption controls leads to a fragmented experience. I can't just go to the nodepool disruption controls to understand the full behavior of disruption(there are other things like PDBS etc that may block consolidation, but I am referring to karpenter settings)
It gives more control on the annotation level vs just sticking somewhere in the api where that control is tied to the nodepool
There's an argument that you could make this field "part of the API" by making it part of the NodeClaim since that's basically the place where Karpenter is doing API configuration for the Node. Annotations (in a way) are an extension of the API when you don't have full control over the schema yourself. This makes sense in the context of pods since we don't control the Pod API, but make less sense with Nodes where we do have a representation of the thing already (NodeClaim)
To play devil's advocate, the counter-argument to this is using karpenter.sh/do-not-disrupt
in both places means that there's a single concept that users can reason about everywhere rather than having fragmented concepts. I don't think there's necessarily a huge burden here if we decided not to go the same route with both api concepts, but food for thought.
We reconcile anything in the template as drift, though I believe we ignore any mutations of the underlying nodeclaim / node. Does it make sense to implement it at the NodePool level as a first class API, and then also as an annotation that can be manually applied to pods and nodes?
When discussing the implementation at the node pool level, it's important to note that we already have some similar concepts in disruption controls. We have consolidateAfter, expireAfter, so disruptAfter seems to fit right into the group.
spec: # This is not a complete NodePool Spec.
disruption:
consolidationPolicy: WhenUnderutilized || WhenEmpty
consolidateAfter: 10m || Never # metav1.Duration
# add disruptAfter: 10m || Never
expireAfter: 10m || Never # Equivalent to v1alpha5 TTLSecondsUntilExpired
To play devil's advocate, the counter-argument to this is using karpenter.sh/do-not-disrupt in both places means that there's a single concept that users can reason about everywhere rather than having fragmented concepts.
I like the idea of both, if not for drift we could potentially control disruption through the whole nodepool like so.
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
# Annotations are arbitrary key-values that are applied to all nodes
annotations:
karpenter.sh/consolidateAfter: 10m
karpenter.sh/disruptAfter: "10m"
ConsolidateAfter, expireAfter, and potentially disruptAfter should all be annotations. It makes sense to have this concept of disruptionControls under the crd for controlling the disruption alongside the annotations still. Maybe its a bit redundant but thats ok, let users pick what they want to use if its not too high of a burden.
ConsolidateAfter, expireAfter, and potentially disruptAfter should all be annotations
I'm not sure I love the idea of spreading out the API like this. I definitely think that there's benefit to having karpenter.sh/do-not-disrupt
for nodes for ad-hoc debugging. That was the reason that the feature was originally added in the first place. If we added support for the duration value on pods, we may just be stuck implementing it on the Node as well for parity and you just get the feature for free if you pass the annotations down like above.
From the perspective of defining minNodeLifetime at the NodePool level, I could see the argument for adding this into the NodePool so that it doesn't drift everything. We're actually having a similar discussion in #834 where there's a bit of a back-and-forth around whether we should avoid drifting on certain properties of the NodePool.
Similar to the terminationGracePeriod
feature that we are talking about, this is another field that I could see living on the NodeClaim and be templated down from the NodePool. minLifetime
really has a meaning in the context of stand-alone NodeClaims (really maxLifetime
does as well, but we currently have this as expiry in the NodePool disruption block). We may be stuck with expiry in the disruption block at this point, most importantly because expiration has to be reasoned about for the disruption budgets.
It does raise a question though: Are there values on NodeClaims that can also be "control" mechanisms that don't cause drifts on Nodes but are updated in-place similar to the disruption block in the NodePool.
I'm relatively new to Karpenter so I may be mistaken here, but from my experience so far it feels like there are at least two kinds of disruptions that should be supported:
do-not-disrupt
annotations entirely -- workloads should not be able to block disruption. Cluster administrators are not frequently executing disruptions in this mode (e.g. once a month, once a quarter -- whatever the update policy is in the organization). But in any case, the disruption should be immediate (as if do-not-disrupt
were not present) and not require any grace period (although I suppose it doesn't hurt to have one). Without this, cluster maintenance is frankly just annoying -- the cluster admin has to find all the annotated workloads and remove the annotation or else kill nodes by hand. do-not-disrupt
in all cases in this mode. In these cases, the workload owner has chosen that they don't care about saving money (or other considerations) and want their workload to run to completion on the existing infrastructure (e.g. a long-running ML job). Right now, it seems like mode 1 is not a possibility. A workload owner can trivially and indefinitely block a node from de-provisioning, which is a strange power for a workload owner to have. Adding OPA rules to prevent labeling workloads is also not a solution since it's completely valid for workload owners to exercise this power under normal circumstances. Just not when a cluster admin is trying to perform maintenance.
Additionally, having a single TTL / ignore disruptions settings for both of these cases doesn't really seem to make sense. If I have a disruption TTL for cluster maintenance events, I strictly do not want this to apply to consolidation events. These are completely separate concerns. Consolidation events can have their own TTL, but I don't think it makes sense to have a single, global setting.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/lifecycle frozen
@jcmcken you should look to terminationGracePeriod, which should solve your use-case for case 1: https://github.com/kubernetes-sigs/karpenter/pull/916/files
Tell us about your request I'd like the ability to taint a node after a given period via
ttlSecondsUntilTainted
to allow running pods to finish before the node is terminated, this should still respect thettlSecondsUntilExpired
andttlSecondsAfterEmpty
.Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? When running jobs such as CI/CD pipelines in K8s there are long running jobs which shouldn't be terminated due to the high cost of re-running them, by adding
ttlSecondsUntilTainted
we can have nodes that expire and are replaced without the cost of killing potentially long running jobs.Are you currently working around this issue? Not using Karpenter.
Additional context n/a
Attachments n/a
Community Note