pires commented 9 years ago

I haven't found a way of pausing/decommissioning a node, have all its containers stopped and recreated elsewhere in the cluster.

This would be great for node upgrades (hardware, OS, etc.).

Obviously, the node would have to be blacklisted so that no new containers are scheduled to it.

/cc @jmreicha

erictune commented 9 years ago

The pods would have to have replication controllers for that to work. @ddysher can say if decommisioning is currently possible, or what is planned, and how that would interact with stopping of pods.

pires commented 9 years ago

Yes, having replication controllers is always the recommended way, but since Kubernetes allows for manual pod scheduling (with a pod descriptor instead of a replica controller descriptor) I think it would also be nice to support something like:

Node is marked as scheduled for decommission
Scheduler (?) is informed about this and reschedules pod(s) to other node(s)

ddysher commented 9 years ago

We are not there yet. The best way is to use replication controllers for now.

After PR #3733 gets landed, I'll work on node lifecycle management and decommission. Node has a Terminated phase which is when k8s tries to re-schedule pods. This is for your first point.

For the second point, it'll be a little subtle to have all other components to understand node phase, especially the interactions between node controller, replication controller, and scheduler. I think the best way is to have node controller marks all pods as unscheduled, which triggers scheduler to re-schedule them. From the perspective of replication controller, the pod never fails, so we don't have to distinguish manual pod vs. controller-managed pods. There are probably more involved here, like restart policy, volumn, etc, will think it more.

erictune commented 9 years ago

I don't think we want to do the second point. Pods should not come back to life on a different node after they have been stopped on one node. We want the phase state machine to be a DAG not a general graph.

alex-mohr commented 9 years ago

I personally think there's a reasonably-big usability gain from allowing pods to declare a restart policy that includes node failure and little downside. We already allow pods that are scheduled on a machine to restart as new containers if they crash, so the phase state machine already has cycles. And it seems strange to allow restarts only if containers on a node fail, but ban them when they'd start on a new node.

As a not-so-strawman example, setting "onNodeFailure: reschedule" seems like a reasonable user request.

ddysher commented 9 years ago

Do we have cycles in current state machine? Restarting containers seems to be a cycle, by containers are not primitive in k8s. We do create Pod on failure from replication controller, but the newly created Pods are not the same entities as previous ones, they have different UIDs at least.

What the second point (re-schedule) really sounds like is migrating the Pods from failed node to new node. This is tricky, not to mention volumn, sessions. I have no objections to the DAG approach, but I agree with @alex-mohr that we need to do something here for usability. To do this, we'll at least need to distinguish manually-created Pod and replication-created Pod, maybe reverse label lookup? I don't know if that ever works.

bgrant0607 commented 9 years ago

We need to be able to drain machines. We should use an approach consistent with other resources. See #1535 re. graceful termination.

bgrant0607 commented 9 years ago

@alex-mohr We've discussed this before. Pods are replaced by replication controllers. They are not rescheduled. In fact, I'd like the replication controller to be able to replace pods ahead of killing the ones they are replacing in the case of a planned move. Let's not derail this issue with the "forever pod" discussion.

At the moment, the system doesn't have reasonable behavior in the cases of either planned or unplanned node outages. Let's fix that -- #1366 covers the unplanned case. This issue can cover the planned case.

bgrant0607 commented 9 years ago

@ddysher Why do you want reverse label lookup? The node controller needn't be aware of pod controllers -- replication controller, job controller, etc.

jmreicha commented 9 years ago

It would be great to have something in kubectl that allowed you to mark a node to be pulled out of rotation (and then showed up with kubectl get minion) and then drained after being marked for deletion or maintenance.

I ran into an issue like this the other day where I needed to rotate out some hosts for maintenance and had to manually remove pods after stopping the servers.

ddysher commented 9 years ago

@bgrant0607 The reason I'm trying to do so is the possible overlapping functions between node controller and replication controller or job controller.

If all pods are started with replication/job controller, then node controller just needs to remove the pods. But cases like the issue where pods are started without any controller, node controller should be responsible to remove them and recreate them elsewhere. The restart part seems to be a duplicate function, ie. node controller and replication controller will alll try to create a new pod.

If node controller just removes the pod, then this seems to break our restart policy. A user would want a pod with restartalway to always restart even in case of node failure.

Did I interpret it correctly? I can't recall of any component that would claim 'ownership' of those pods.

bgrant0607 commented 9 years ago

@ddysher No, the node controller should never recreate pods elsewhere. That's not its job. Users that want that behavior need to run the pods under a pod controller, such as the replication controller. No, it doesn't invalidate restart policy -- separation of concerns. Individual pods are vulnerable to node failure -- that's reality and the model. See #3949 for more details.

ddysher commented 9 years ago

The model makes node controller much easier, and that's definitely a good thing. But from a user's perspective (not from how we design/simplify the system), node failure without Pod restart is really confusing. We haven't stressed enough about pod controller, even our classic example creates naked pod, as you mentioned in #1603.

Here I'm not saying we should recreate pods, just bring up a potential issue if we don't do so :)

bgrant0607 commented 9 years ago

Yes. we should fix our broken examples.

pires commented 9 years ago

I think you should remove the notion of a pod without a replication controller. It would simplify the possible scenarios. On Feb 4, 2015 4:47 AM, "Brian Grant" notifications@github.com wrote:

Yes. we should fix our broken examples.

— Reply to this email directly or view it on GitHub https://github.com/GoogleCloudPlatform/kubernetes/issues/3885#issuecomment-72785808 .

bgrant0607 commented 9 years ago

@pires Been there, done that. Pod needs to be an available primitive. One reason is that we plan to support multiple types of controllers.

pravisankar commented 9 years ago

I think ability to mark the node as deactived/decommissioned using kubectl gives flexibility to the user/admin to do node upgrades(security patches, software upgrades), node evacuation/custom pod migration. To support this use case, may be we can add a new condition on the node, say 'NodeDeactivate'. When 'NodeDeactivate' status is set(Full), irrespective of NodeReady/NodeReachable condition status, scheduler can ignore this node for new pod creation. CLI can be: Node Deactivation: kubectl update nodes --patch={'apiversion': , 'status': {'conditions': [{'kind': 'Deactivate', 'status': 'Full'}]}}

Node Activation: kubectl update nodes --patch={'apiversion': , 'status': {'conditions': [{'kind': 'Deactivate', 'status': 'None'}]}}

I'm planning to implement this feature, let me know if you see any issues with this approach. @bgrant0607 @smarterclayton @ddysher @alex-mohr

bgrant0607 commented 9 years ago

@pravisankar Discussion on #1535, #2315, and #2726 is relevant.

Status must be completely reconstructable based on observations. In order to express that the desired state is "deactivated", there would need to be a field in NodeSpec that indicates this. There can additionally be a NodeCondition that reflects the status.

I imagine we'll eventually want several flavors of this:

stop: gracefully terminate and remove the node -- we need to define what that means, but I'd like some reasonably consistent, useful definition of this for every object; we'll at least want to wait for pod pre-stop hooks to complete
unschedulable: simply don't schedule new pods/volumes, but don't delete existing ones
uninhabitable: evict existing pods/volumes
shutdown: actually do something physical to the machine, like invoking shutdown

Note that there's some amount of nuance in some of the above. We will eventually want to differentiate between different types of workloads, such as daemons vs. user pods.

I eventually want a custom control subresource to set whatever such fields we add, as discussed in #2726, but patch would work initially.

roberthbailey commented 9 years ago

This issue has quite a bit of overlap with https://github.com/kubernetes/kubernetes/issues/5511 and https://github.com/kubernetes/kubernetes/issues/6080.

mikedanese commented 9 years ago

Is this satisfied by switching a kubelet to OperatorSwitchedOff=true or Accepting=false if the node controller evicts pods from kubelet's with that condition per #14140? If so, this will hopefully make it into 1.1

sghosh151 commented 9 years ago

FYI: Nomad is providing a draining = true/false status https://nomadproject.io/intro/getting-started/running.html

smarterclayton commented 9 years ago

A Draining condition makes sense as long as it could be properly reset by the various controllers. For instance, OpenShift evacuation prereqs unschedulable condition, but could easily set and clear a Draining.

bgrant0607 commented 9 years ago

We also need to decide the division of draining responsibilities between kubelet, node controller, and rescheduler (#12140).

brendandburns commented 9 years ago

I think that the right way to do this is:

Mark a node as unschedulable
One by one delete all pods on that node, if the pod is part of a replication controller, wait for that replication controller to come back to full size.

We can either do this in the node controller or in kubectl.

Starting with it in kubectl seems like the easiest approach.

bgrant0607 commented 9 years ago

@brendandburns

I'm ok with starting with a kubectl-based implementation and moving it to the server later #12143. We should use annotations to indicate what's going on.

Proposed commands:

kubectl lame <node>: set unschedulable to true
kubectl unlame <node>: set unschedulable to false (open to other command names)
kubectl drain <node>: do what you described

Related to #5511.

@mikedanese I don't think this overlaps with #14054.

Note that setting unschedulable won't block DaemonSet nor other direct specification of nodeName. We'd need a new API field to disable admittance for that, perhaps like the disable enum I proposed in #14054: None, Scheduling, Admittance, Execution.

When moving the server-side implementation, for such planned drains, we should think about how to respect disruption policy along the lines of maxUnvailable and maxSurge: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/experimental/v1alpha1/types.go#L256

smarterclayton commented 9 years ago

We could just upstream the openshift cli command for evacuation - it does exactly what Brendan describes, but does have a few more bells and whistles for admins to work with pods.

bgrant0607 commented 9 years ago

@smarterclayton What command? https://github.com/openshift/origin/blob/master/pkg/cmd/cli/cli.go

mikedanese commented 9 years ago

@bgrant0607 part of oadm. That file is ocli. See comments starting here https://github.com/kubernetes/kubernetes/issues/6080#issuecomment-144870892

davidopp commented 9 years ago

Sorry, I accidentally put some comments about this in #6080 a few days ago, but this issue is more appropriate. Here's the concatenation of what I had written:

Notes from a brief discussion with @brendandburns today:

Initial version (client-driven):

for (each node in sequence) {
    client marks node unschedulable
    client kills pods one at a time (triggering them to get their graceful deletion notice before being forcefully killed), doesn't kill next one until it reschedules
}

Things we can do later to make it more sophisticated

move it to server-side (i.e. somehow tell the server "I want you to shut down machine X" and it does the orchestration itself; the request could be represented by a new field in NodeSpec and handled by node controller, or whatever)
have the killing respect collection-level (e.g. per-service) SLOs rather than requiring each pod reschedule before moving on to the next one, which would allow us to kill more aggressively (see #12611)

@mikedanese mentioned that a form of this is already available in OpenShift: https://docs.openshift.com/enterprise/3.0/admin_guide/manage_nodes.html#evacuating-pods-on-nodes

(And the corresponding code is here: https://github.com/openshift/origin/blob/master/pkg/cmd/admin/node/evacuate.go)

Compared to what I described in the earlier comment, the main differences are

oadm takes list of nodes on the command line, rather that iterating through all nodes
oadm checks to make sure the node has already been marked unschedulable, rather than marking the node unschedulable itself
oadm inserts no delays between killing, either between killing pods on a machine or between machines
oadm skips (doesn't kill) pods that aren't managed by an RC, unless you use --force option

davidopp commented 9 years ago

Note that setting unschedulable won't block DaemonSet nor other direct specification of nodeName.

We should make DaemonSet obey unschedulable, and we really should move away from allowing direct specification of nodeName by anything that isn't a scheduler (e.g. DaemonSet controller and the real scheduler). This issue is a good example of why allowing clients to set nodeName is a bad idea.

timothysc commented 9 years ago

@davidopp quick question, these proposals don't mention anything of migration or forgiveness in the case where a pod could be using local storage for say something like HDFS.

Is the intent to defer till that time, or punt on data-gravity entirely?

davidopp commented 9 years ago

Punting. I agree we need to consider those issues, but I assume that's not necessary until we have local volumes that are decoupled from pod lifetime (here is where I would cite the issue number for that feature if I had a half-decent memory).

bgrant0607 commented 9 years ago

Copying text from #14054:

There are a few behaviors we need to control:

Scheduler placement of new pods
Kubelet acceptance/rejection of new pods
Executing/running pods will be killed (or not)

These are in a strict hierarchy. If pods shouldn't run, they also shouldn't be accepted, and if they aren't being accepted, they shouldn't be scheduled.

We'll need to control which pods are affected:

Scheduled pods
Specified using nodeName (by daemon controller, user, etc.)
Static pods
taking forgiveness into account
taking disruption quota/SLO into account
by some characterization of the workload ("priority", labels, etc.)
etc.

The behaviors will be triggered by different means:

NodeSpec (e.g., unschedulable)
Kubelet config
Node reachability / readiness / other health status/conditions (e.g., Ready)

We may need to distinguish graceful vs. immediate termination, also.

For now, we want to whether any pods other than static pods can be run. I don't think we care whether termination is graceful or not, since we don't plan to flip the configuration dynamically.

We probably should represent the matrix, or at least the necessary subset of it, rather than just using a single condition and/or knob.

Configuration could look something like:

ClusterPodSourceDisabled: {None, Scheduling, Admittance, Execution}
NodePodSourceDisabled: {None, Scheduling, Admittance, Execution}

See also pkg/capabilities/capabilities.go.

bgrant0607 commented 9 years ago

@davidopp Making DaemonSet obey unschedulable is problematic. For instance, we'd like to use it to run kube-proxy in the future.

As the comment I just posted attempted to convey, we'll need more control than just a single bool.

davidopp commented 9 years ago

I don't think we care whether termination is graceful or not, since we don't plan to flip the configuration dynamically.

Can you explain this more? Assuming we're talking about a procedure you'd use for things like kernel upgrades on bare metal (use case in first entry in this issue), it seems you would want graceful termination of the pods that are running on the machine.

We probably should represent the matrix, or at least the necessary subset of it, rather than just using a single condition and/or knob. Configuration could look something like: ClusterPodSourceDisabled: {None, Scheduling, Admittance, Execution} NodePodSourceDisabled: {None, Scheduling, Admittance, Execution}

I didn't fully understand this proposal.

(BTW, I am assuming all of these are set in the NodeSpec, i.e. they're separate from how the system indicates and responds to node health failure or kubelet config).

What is meant by "cluster" vs. "node" pod source? Is "node" static pods and "cluster" everything else?

Also, can you explain why you distinguish Scheduling vs. Admittance? While I agree that in general there may be situations where Kubelet might reject a pod the scheduler thought was OK, it seems that from the standpoint of explicitly setting schedulability top-down (i.e. via the API server), you'd always want them to be the same.

Making DaemonSet obey unschedulable is problematic. For instance, we'd like to use it to run kube-proxy in the future.

This is a good point but how do you generalize this? I have a hard time coming up with a reasonable name for "scheduling disabled except for daemons that should always be running"...

smarterclayton commented 9 years ago

Isn't unschedulable implicitly about user requests for access to compute resources? Whereas daemon sets are about user requests to ensure hosts are running pods? Unschedulable doesn't stop nodeName being set today.

In an evacuation, unschedulable is really about ensuring that user compute isn't sent to this host, but not that an explicit request by an admin, to say, run a pod on that node that executes a command in the host pid namespace to kill a faulty daemon.

On Oct 7, 2015, at 2:51 AM, David Oppenheimer notifications@github.com wrote:

I don't think we care whether termination is graceful or not, since we don't plan to flip the configuration dynamically.

Can you explain this more? Assuming we're talking about a procedure you'd use for things like kernel upgrades on bare metal (use case in first entry in this issue), it seems you would want graceful termination of the pods that are running on the machine.

We probably should represent the matrix, or at least the necessary subset of it, rather than just using a single condition and/or knob. Configuration could look something like: ClusterPodSourceDisabled: {None, Scheduling, Admittance, Execution} NodePodSourceDisabled: {None, Scheduling, Admittance, Execution}

I didn't fully understand this proposal.

(BTW, I am assuming all of these are set in the NodeSpec, i.e. they're separate from how the system indicates and responds to node health failure or kubelet config).

What is meant by "cluster" vs. "node" pod source? Is "node" static pods and "cluster" everything else?

Also, can you explain why you distinguish Scheduling vs. Admittance? While I agree that in general there may be situations where Kubelet might reject a pod the scheduler thought was OK, it seems that from the standpoint of explicitly setting schedulability top-down (i.e. via the API server), you'd always want them to be the same.

Making DaemonSet obey unschedulable is problematic. For instance, we'd like to use it to run kube-proxy in the future.

This is a good point but how do you generalize this? I have a hard time coming up with a reasonable name for "scheduling disabled except for daemons that should always be running"...

— Reply to this email directly or view it on GitHub https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146095365 .

davidopp commented 9 years ago

Isn't unschedulable implicitly about user requests for access to compute resources? Whereas daemon sets are about user requests to ensure hosts are running pods?

I see. So your argument is that daemon set should always ignore Unschedulable. That's reasonable. I was assuming that some things scheduled by daemon set should not ignore Unscheduleable, but thinking about it more, I can't think of any example.

bgrant0607 commented 9 years ago

cc @mml

mml commented 9 years ago

I have an implementation of the simplest version of this: drain a single node (starting with setting unschedulable), with optional grace period, and a --force flag to force removal even if there are unreplicated pods. I think we should also include a convenient way to add a machine back to service when maintenance is done. kubectl undrain $node would be equivalent to kubectl patch node $node -p'{"spec":{"unschedulable":false}}'.

If an admin wants to operate on multiple nodes and they want to sleep between nodes (the crudest form of "safety"), I recommend

  for node in $nodes; do
    kubectl drain --grace=900 $node                                                                                                                                                                    
    sleep 300 
  done

Anyway, I don't have permission to edit assignee, but @mikedanese or @bgrant0607 can one of you assign this to me?

The rest of this comment might be outside the scope of this issue. It's the beginnings of a design for safety that's more sophisticated than sleep.

If we want more sophisticated "safety" than the sleep loop, we could offer two parameters: minimum shard strength and time between evictions. Eventually, these would be specified either as cluster policies or specified by the user when they create the pod (or pod template). However, we can probably get pretty far by simply exposing them as knobs to the cluster admin when they want to do maintenance. In this case, since kubectl needs to keep track of all the disruption it causes, we want all the nodes passed in at once:

  kubectl drain \
    --min-shard-strength=0.75 \
    --min-seconds-between-evictions=900 \
    --grace=900 $nodes

--min-shard-strength is a value from 0 to 1. If the fraction of pods managed by a given RC with Ready=True drops below this value, we won't cause another eviction to that set. In addition, we always wait at least --min-seconds-between-evictions between subsequent evictions to the pods managed by a given RC.

janetkuo commented 9 years ago

Trying to assign this to @mml but couldn't do so.

mikedanese commented 9 years ago

@mml please accept your invite to the org at https://github.com/kubernetes so we can assign you issues

mml commented 9 years ago

@mikedanese done thx

bgrant0607 commented 9 years ago

Re. shard strength, see maxUnavailable used by Deployment: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/deployment.md

12611

paralin commented 9 years ago

I would +1 some sort of way to mark a node as unscheduleable except for daemons. Or perhaps, marking a pod as something that can override the unschedulable flag.

dnelson commented 9 years ago

:+1: This will really help with gracefully handling EC2 Spot Instance evictions. The "simplest" version @mml describes above is plenty for this use case. I would just do kubectl drain --grace=90 $this_node when the AWS API shows that eviction will happen in 2 minutes.

therc commented 8 years ago

For the next iteration, it would be nice to have a way to provide specific resources to drain. In the discussion about GPUs in #19049, I mentioned specific kernel driver ABIs. Draining a whole machine works, but if all we need is kicking out GPU users, maybe we could evict just the pods using the resource. Similar reasoning if you wanted to reformat attached SSDs, etc.

mikedanese commented 8 years ago

That could be achieved with taints and a rescheduler. At least taints will make probably it into 1.3. I think what we planned for 1.2 is complete.

https://github.com/kubernetes/kubernetes/issues/17190

davidopp commented 8 years ago

ref/ #22217

leecalcote commented 8 years ago

When considering the maintenance mode use case, it'd be good to account for the ability to schedule the node drain. In this way, administrators may set a predefined maintenance window for specific nodes.

Are Jobs a good candidate to orchestrate a maintenance window once node drain is implemented?

kubernetes / kubernetes

Mark node to be decommissioned and act accordingly #3885

12611