Closed pires closed 7 years ago
The pods would have to have replication controllers for that to work. @ddysher can say if decommisioning is currently possible, or what is planned, and how that would interact with stopping of pods.
Yes, having replication controllers is always the recommended way, but since Kubernetes allows for manual pod scheduling (with a pod descriptor instead of a replica controller descriptor) I think it would also be nice to support something like:
We are not there yet. The best way is to use replication controllers for now.
After PR #3733 gets landed, I'll work on node lifecycle management and decommission. Node has a Terminated
phase which is when k8s tries to re-schedule pods. This is for your first point.
For the second point, it'll be a little subtle to have all other components to understand node phase, especially the interactions between node controller, replication controller, and scheduler. I think the best way is to have node controller marks all pods as unscheduled, which triggers scheduler to re-schedule them. From the perspective of replication controller, the pod never fails, so we don't have to distinguish manual pod vs. controller-managed pods. There are probably more involved here, like restart policy, volumn, etc, will think it more.
I don't think we want to do the second point. Pods should not come back to life on a different node after they have been stopped on one node. We want the phase state machine to be a DAG not a general graph.
I personally think there's a reasonably-big usability gain from allowing pods to declare a restart policy that includes node failure and little downside. We already allow pods that are scheduled on a machine to restart as new containers if they crash, so the phase state machine already has cycles. And it seems strange to allow restarts only if containers on a node fail, but ban them when they'd start on a new node.
As a not-so-strawman example, setting "onNodeFailure: reschedule" seems like a reasonable user request.
Do we have cycles in current state machine? Restarting containers seems to be a cycle, by containers are not primitive in k8s. We do create Pod on failure from replication controller, but the newly created Pods are not the same entities as previous ones, they have different UIDs at least.
What the second point (re-schedule) really sounds like is migrating the Pods from failed node to new node. This is tricky, not to mention volumn, sessions. I have no objections to the DAG approach, but I agree with @alex-mohr that we need to do something here for usability. To do this, we'll at least need to distinguish manually-created Pod and replication-created Pod, maybe reverse label lookup? I don't know if that ever works.
We need to be able to drain machines. We should use an approach consistent with other resources. See #1535 re. graceful termination.
@alex-mohr We've discussed this before. Pods are replaced by replication controllers. They are not rescheduled. In fact, I'd like the replication controller to be able to replace pods ahead of killing the ones they are replacing in the case of a planned move. Let's not derail this issue with the "forever pod" discussion.
At the moment, the system doesn't have reasonable behavior in the cases of either planned or unplanned node outages. Let's fix that -- #1366 covers the unplanned case. This issue can cover the planned case.
@ddysher Why do you want reverse label lookup? The node controller needn't be aware of pod controllers -- replication controller, job controller, etc.
It would be great to have something in kubectl that allowed you to mark a node to be pulled out of rotation (and then showed up with kubectl get minion
) and then drained after being marked for deletion or maintenance.
I ran into an issue like this the other day where I needed to rotate out some hosts for maintenance and had to manually remove pods after stopping the servers.
@bgrant0607 The reason I'm trying to do so is the possible overlapping functions between node controller and replication controller or job controller.
If all pods are started with replication/job controller, then node controller just needs to remove the pods. But cases like the issue where pods are started without any controller, node controller should be responsible to remove them and recreate them elsewhere. The restart part seems to be a duplicate function, ie. node controller and replication controller will alll try to create a new pod.
If node controller just removes the pod, then this seems to break our restart policy. A user would want a pod with restartalway to always restart even in case of node failure.
Did I interpret it correctly? I can't recall of any component that would claim 'ownership' of those pods.
@ddysher No, the node controller should never recreate pods elsewhere. That's not its job. Users that want that behavior need to run the pods under a pod controller, such as the replication controller. No, it doesn't invalidate restart policy -- separation of concerns. Individual pods are vulnerable to node failure -- that's reality and the model. See #3949 for more details.
The model makes node controller much easier, and that's definitely a good thing. But from a user's perspective (not from how we design/simplify the system), node failure without Pod restart is really confusing. We haven't stressed enough about pod controller, even our classic example creates naked pod, as you mentioned in #1603.
Here I'm not saying we should recreate pods, just bring up a potential issue if we don't do so :)
Yes. we should fix our broken examples.
I think you should remove the notion of a pod without a replication controller. It would simplify the possible scenarios. On Feb 4, 2015 4:47 AM, "Brian Grant" notifications@github.com wrote:
Yes. we should fix our broken examples.
— Reply to this email directly or view it on GitHub https://github.com/GoogleCloudPlatform/kubernetes/issues/3885#issuecomment-72785808 .
@pires Been there, done that. Pod needs to be an available primitive. One reason is that we plan to support multiple types of controllers.
I think ability to mark the node as deactived/decommissioned using kubectl gives flexibility to the user/admin to do node upgrades(security patches, software upgrades), node evacuation/custom pod migration.
To support this use case, may be we can add a new condition on the node, say 'NodeDeactivate'. When 'NodeDeactivate' status is set(Full), irrespective of NodeReady/NodeReachable condition status, scheduler can ignore this node for new pod creation.
CLI can be:
Node Deactivation:
kubectl update nodes
Node Activation:
kubectl update nodes
I'm planning to implement this feature, let me know if you see any issues with this approach. @bgrant0607 @smarterclayton @ddysher @alex-mohr
@pravisankar Discussion on #1535, #2315, and #2726 is relevant.
Status must be completely reconstructable based on observations. In order to express that the desired state is "deactivated", there would need to be a field in NodeSpec that indicates this. There can additionally be a NodeCondition that reflects the status.
I imagine we'll eventually want several flavors of this:
Note that there's some amount of nuance in some of the above. We will eventually want to differentiate between different types of workloads, such as daemons vs. user pods.
I eventually want a custom control
subresource to set whatever such fields we add, as discussed in #2726, but patch would work initially.
This issue has quite a bit of overlap with https://github.com/kubernetes/kubernetes/issues/5511 and https://github.com/kubernetes/kubernetes/issues/6080.
Is this satisfied by switching a kubelet to OperatorSwitchedOff=true or Accepting=false if the node controller evicts pods from kubelet's with that condition per #14140? If so, this will hopefully make it into 1.1
FYI: Nomad is providing a draining = true/false status https://nomadproject.io/intro/getting-started/running.html
A Draining condition makes sense as long as it could be properly reset by the various controllers. For instance, OpenShift evacuation prereqs unschedulable condition, but could easily set and clear a Draining.
See also discussion here: https://github.com/kubernetes/kubernetes/pull/14054#issuecomment-144229094
We also need to decide the division of draining responsibilities between kubelet, node controller, and rescheduler (#12140).
I think that the right way to do this is:
We can either do this in the node controller or in kubectl.
Starting with it in kubectl seems like the easiest approach.
@brendandburns
I'm ok with starting with a kubectl-based implementation and moving it to the server later #12143. We should use annotations to indicate what's going on.
Proposed commands:
kubectl lame <node>
: set unschedulable to truekubectl unlame <node>
: set unschedulable to false (open to other command names)kubectl drain <node>
: do what you describedRelated to #5511.
@mikedanese I don't think this overlaps with #14054.
Note that setting unschedulable won't block DaemonSet nor other direct specification of nodeName. We'd need a new API field to disable admittance for that, perhaps like the disable enum I proposed in #14054: None, Scheduling, Admittance, Execution.
When moving the server-side implementation, for such planned drains, we should think about how to respect disruption policy along the lines of maxUnvailable and maxSurge: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/experimental/v1alpha1/types.go#L256
We could just upstream the openshift cli command for evacuation - it does exactly what Brendan describes, but does have a few more bells and whistles for admins to work with pods.
@smarterclayton What command? https://github.com/openshift/origin/blob/master/pkg/cmd/cli/cli.go
@bgrant0607 part of oadm. That file is ocli. See comments starting here https://github.com/kubernetes/kubernetes/issues/6080#issuecomment-144870892
Sorry, I accidentally put some comments about this in #6080 a few days ago, but this issue is more appropriate. Here's the concatenation of what I had written:
Notes from a brief discussion with @brendandburns today:
Initial version (client-driven):
for (each node in sequence) {
client marks node unschedulable
client kills pods one at a time (triggering them to get their graceful deletion notice before being forcefully killed), doesn't kill next one until it reschedules
}
Things we can do later to make it more sophisticated
@mikedanese mentioned that a form of this is already available in OpenShift: https://docs.openshift.com/enterprise/3.0/admin_guide/manage_nodes.html#evacuating-pods-on-nodes
(And the corresponding code is here: https://github.com/openshift/origin/blob/master/pkg/cmd/admin/node/evacuate.go)
Compared to what I described in the earlier comment, the main differences are
Note that setting unschedulable won't block DaemonSet nor other direct specification of nodeName.
We should make DaemonSet obey unschedulable, and we really should move away from allowing direct specification of nodeName by anything that isn't a scheduler (e.g. DaemonSet controller and the real scheduler). This issue is a good example of why allowing clients to set nodeName is a bad idea.
@davidopp quick question, these proposals don't mention anything of migration or forgiveness in the case where a pod could be using local storage for say something like HDFS.
Is the intent to defer till that time, or punt on data-gravity entirely?
Punting. I agree we need to consider those issues, but I assume that's not necessary until we have local volumes that are decoupled from pod lifetime (here is where I would cite the issue number for that feature if I had a half-decent memory).
Copying text from #14054:
There are a few behaviors we need to control:
These are in a strict hierarchy. If pods shouldn't run, they also shouldn't be accepted, and if they aren't being accepted, they shouldn't be scheduled.
We'll need to control which pods are affected:
The behaviors will be triggered by different means:
We may need to distinguish graceful vs. immediate termination, also.
For now, we want to whether any pods other than static pods can be run. I don't think we care whether termination is graceful or not, since we don't plan to flip the configuration dynamically.
We probably should represent the matrix, or at least the necessary subset of it, rather than just using a single condition and/or knob.
Configuration could look something like:
See also pkg/capabilities/capabilities.go.
@davidopp Making DaemonSet obey unschedulable is problematic. For instance, we'd like to use it to run kube-proxy in the future.
As the comment I just posted attempted to convey, we'll need more control than just a single bool.
I don't think we care whether termination is graceful or not, since we don't plan to flip the configuration dynamically.
Can you explain this more? Assuming we're talking about a procedure you'd use for things like kernel upgrades on bare metal (use case in first entry in this issue), it seems you would want graceful termination of the pods that are running on the machine.
We probably should represent the matrix, or at least the necessary subset of it, rather than just using a single condition and/or knob. Configuration could look something like: ClusterPodSourceDisabled: {None, Scheduling, Admittance, Execution} NodePodSourceDisabled: {None, Scheduling, Admittance, Execution}
I didn't fully understand this proposal.
(BTW, I am assuming all of these are set in the NodeSpec, i.e. they're separate from how the system indicates and responds to node health failure or kubelet config).
What is meant by "cluster" vs. "node" pod source? Is "node" static pods and "cluster" everything else?
Also, can you explain why you distinguish Scheduling vs. Admittance? While I agree that in general there may be situations where Kubelet might reject a pod the scheduler thought was OK, it seems that from the standpoint of explicitly setting schedulability top-down (i.e. via the API server), you'd always want them to be the same.
Making DaemonSet obey unschedulable is problematic. For instance, we'd like to use it to run kube-proxy in the future.
This is a good point but how do you generalize this? I have a hard time coming up with a reasonable name for "scheduling disabled except for daemons that should always be running"...
Isn't unschedulable implicitly about user requests for access to compute resources? Whereas daemon sets are about user requests to ensure hosts are running pods? Unschedulable doesn't stop nodeName being set today.
In an evacuation, unschedulable is really about ensuring that user compute isn't sent to this host, but not that an explicit request by an admin, to say, run a pod on that node that executes a command in the host pid namespace to kill a faulty daemon.
On Oct 7, 2015, at 2:51 AM, David Oppenheimer notifications@github.com wrote:
I don't think we care whether termination is graceful or not, since we don't plan to flip the configuration dynamically.
Can you explain this more? Assuming we're talking about a procedure you'd use for things like kernel upgrades on bare metal (use case in first entry in this issue), it seems you would want graceful termination of the pods that are running on the machine.
We probably should represent the matrix, or at least the necessary subset of it, rather than just using a single condition and/or knob. Configuration could look something like: ClusterPodSourceDisabled: {None, Scheduling, Admittance, Execution} NodePodSourceDisabled: {None, Scheduling, Admittance, Execution}
I didn't fully understand this proposal.
(BTW, I am assuming all of these are set in the NodeSpec, i.e. they're separate from how the system indicates and responds to node health failure or kubelet config).
What is meant by "cluster" vs. "node" pod source? Is "node" static pods and "cluster" everything else?
Also, can you explain why you distinguish Scheduling vs. Admittance? While I agree that in general there may be situations where Kubelet might reject a pod the scheduler thought was OK, it seems that from the standpoint of explicitly setting schedulability top-down (i.e. via the API server), you'd always want them to be the same.
Making DaemonSet obey unschedulable is problematic. For instance, we'd like to use it to run kube-proxy in the future.
This is a good point but how do you generalize this? I have a hard time coming up with a reasonable name for "scheduling disabled except for daemons that should always be running"...
— Reply to this email directly or view it on GitHub https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146095365 .
Isn't unschedulable implicitly about user requests for access to compute resources? Whereas daemon sets are about user requests to ensure hosts are running pods?
I see. So your argument is that daemon set should always ignore Unschedulable. That's reasonable. I was assuming that some things scheduled by daemon set should not ignore Unscheduleable, but thinking about it more, I can't think of any example.
cc @mml
I have an implementation of the simplest version of this: drain a single node (starting with setting unschedulable), with optional grace period, and a --force flag to force removal even if there are unreplicated pods. I think we should also include a convenient way to add a machine back to service when maintenance is done. kubectl undrain $node
would be equivalent to kubectl patch node $node -p'{"spec":{"unschedulable":false}}'
.
If an admin wants to operate on multiple nodes and they want to sleep between nodes (the crudest form of "safety"), I recommend
for node in $nodes; do kubectl drain --grace=900 $node sleep 300 done
Anyway, I don't have permission to edit assignee, but @mikedanese or @bgrant0607 can one of you assign this to me?
The rest of this comment might be outside the scope of this issue. It's the beginnings of a design for safety that's more sophisticated than sleep.
If we want more sophisticated "safety" than the sleep loop, we could offer two parameters: minimum shard strength and time between evictions. Eventually, these would be specified either as cluster policies or specified by the user when they create the pod (or pod template). However, we can probably get pretty far by simply exposing them as knobs to the cluster admin when they want to do maintenance. In this case, since kubectl
needs to keep track of all the disruption it causes, we want all the nodes passed in at once:
kubectl drain \ --min-shard-strength=0.75 \ --min-seconds-between-evictions=900 \ --grace=900 $nodes
--min-shard-strength
is a value from 0 to 1. If the fraction of pods managed by a given RC with Ready=True drops below this value, we won't cause another eviction to that set. In addition, we always wait at least --min-seconds-between-evictions
between subsequent evictions to the pods managed by a given RC.
Trying to assign this to @mml but couldn't do so.
@mml please accept your invite to the org at https://github.com/kubernetes so we can assign you issues
@mikedanese done thx
Re. shard strength, see maxUnavailable used by Deployment: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/deployment.md
I would +1 some sort of way to mark a node as unscheduleable except for daemons. Or perhaps, marking a pod as something that can override the unschedulable flag.
:+1: This will really help with gracefully handling EC2 Spot Instance evictions. The "simplest" version @mml describes above is plenty for this use case. I would just do kubectl drain --grace=90 $this_node
when the AWS API shows that eviction will happen in 2 minutes.
For the next iteration, it would be nice to have a way to provide specific resources to drain. In the discussion about GPUs in #19049, I mentioned specific kernel driver ABIs. Draining a whole machine works, but if all we need is kicking out GPU users, maybe we could evict just the pods using the resource. Similar reasoning if you wanted to reformat attached SSDs, etc.
That could be achieved with taints and a rescheduler. At least taints will make probably it into 1.3. I think what we planned for 1.2 is complete.
ref/ #22217
When considering the maintenance mode use case, it'd be good to account for the ability to schedule the node drain. In this way, administrators may set a predefined maintenance window for specific nodes.
Are Jobs a good candidate to orchestrate a maintenance window once node drain is implemented?
I haven't found a way of pausing/decommissioning a node, have all its containers stopped and recreated elsewhere in the cluster.
This would be great for node upgrades (hardware, OS, etc.).
Obviously, the node would have to be blacklisted so that no new containers are scheduled to it.
/cc @jmreicha