planetlabs / draino

Automatically cordon and drain Kubernetes nodes based on node conditions
Apache License 2.0
626 stars 84 forks source link

pods left in "unknown" state #44

Open trondhindenes opened 5 years ago

trondhindenes commented 5 years ago

We run draino with the following condition, among others: - Ready=Unknown,3m #Drain node if unavailable for 3 minutes

I simulated a test by stopping/disabling the kubelet on a node. This causes Draino to cordon it as expected, but we're noticing that pods are left in an "unknown" state on the pod since the kubelet is gone:

╰─ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
rikstv-api-recommendation-v1-uat-main-6d9b46bf-9gj44   4/4     Running   1          20m
rikstv-api-recommendation-v1-uat-main-6d9b46bf-k89hz   4/4     Unknown   0          149m
rikstv-api-recommendation-v1-uat-main-6d9b46bf-t84ns   4/4     Running   1          20m
rikstv-api-recommendation-v1-uat-main-6d9b46bf-vdj2g   4/4     Unknown   0          149m

This again seems to trick cluster-autoscaler into not removing the failed node. I'm not sure draino should be responsible for cleaning up pod objects on crashed nodes, but I thought I'd ask here anyway, since it's probably a typical situation draino users can get in to.

Does Draino have (or is anything planned) functionality for cleaning up failed nodes so that cluster-autoscaler can delete them cleanly?

negz commented 5 years ago

Is there any chance you could reproduce this and dump the YAML (i.e. kubectl get -o yaml) one of the pods in an Unknown state? I'm wondering if there's a finalizer or something keeping them around.

We don't have anything planned around this, but I do think it's a use case we should address if there's a clean way to do so.

trondhindenes commented 5 years ago

Nice! I think this just happens when the kubelet is stopped. Replacement pods are spun up on new nodes, but the "orphaned" pods are still stuck. We're doing some testing around this the coming days, will update this issue with an example pod when I get a chance.

prabhatnagpal commented 5 years ago

Please help me out because I am not able to make Draino work. Even if the node is in unknown state it doesn't drain nodes. I have used kops to spin up the cluster with 1 master and 2 nodes. I make the node reach unknown state by using this script and start on a node:-

#!/bin/bash
for (( ; ; ))
do
echo "Press CTRL+C to stop............................................."
nohup ./run.sh &
done

and my draino.yaml is this

---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels: {component: draino}
  name: draino
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels: {component: draino}
  name: draino
rules:
- apiGroups: ['']
  resources: [events]
  verbs: [create, patch, update]
- apiGroups: ['']
  resources: [nodes]
  verbs: [get, watch, list, update]
- apiGroups: ['']
  resources: [nodes/status]
  verbs: [patch]
- apiGroups: ['']
  resources: [pods]
  verbs: [get, watch, list]
- apiGroups: ['']
  resources: [pods/eviction]
  verbs: [create]
- apiGroups: [extensions]
  resources: [daemonsets]
  verbs: [get, watch, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels: {component: draino}
  name: draino
roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino}
subjects:
- {kind: ServiceAccount, name: draino, namespace: kube-system}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels: {component: draino}
  name: draino
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels: {component: draino}
  template:
    metadata:
      labels: {component: draino}
      name: draino
      namespace: kube-system
    spec:
      containers:
      - name: draino
        image: planetlabs/draino:5e07e93
        command:
        - /draino
        - --debug
        - Ready=Unknown
        livenessProbe:
          httpGet: {path: /healthz, port: 10002}
          initialDelaySeconds: 30
      serviceAccountName: draino
negz commented 5 years ago

@prabhatnagpal I suggest raising a separate Github issue for your problems, and including any logs and metrics Draino emits with your issue so we can try to help you out.