michaelmdresser commented 2 years ago

Description

On non-autoscaling clusters where eviction is available, cluster-turndown attempts to evict Pods as part of the "Drain" process. After draining is finished, the node pool is supposed to be scaled down. If a PDB exists in the cluster with a minReplicas > 0, there will be at least one un-evictable pod, meaning draining will never finish.

The eviction logic has an infinite loop which continuously retries eviction that fails with a non-nil, non-IsNotFound, or non-IsTooManyRequests error status. I added a log statement and got this error on my dev cluster from the PolicyV1beta1().Evictions().Evict() call:

I0216 16:12:35.055635       1 draininator.go:396] Evicting in namespace 'guestbook-with-pdb' pod 'frontend-6b6c9c585d-chsvc' failed: Cannot evict pod as it would violate the pod's disruption budget.

Cannot evict pod as it would violate the pod's disruption budget is the error being returned from .Evict().

Reproduce

Create GKE cluster

gcloud container clusters create \
    turndown-pdb-bug \
    --region "us-central1-b" \
    --project "---PROJECTIDHERE---" \
    --num-nodes 3

Create a deployment with a non-zero minReplicas PDB

SETUPYAML=$(cat <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nginx
EOF
)

echo $SETUPYAML | kubectl apply -f -

kubectl get deployment

NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   3/3     3            3           42s

kubectl get pdb

NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nginx-pdb   2               N/A               1                     43s

Put turndown in the cluster

bash ./scripts/gke-create-service-key.sh <yourproject> <servicekeyname>

kubectl get secret -n turndown

NAME                           TYPE                                  DATA   AGE
cluster-turndown-service-key   Opaque                                1      27s
default-token-vspf9            kubernetes.io/service-account-token   3      28s

kubectl apply -f ./artifacts/cluster-turndown-full.yaml

kubectl get pod -n turndown

NAME                                READY   STATUS    RESTARTS   AGE
cluster-turndown-7c7c7bcc74-2k5g4   1/1     Running   0          20s

Create a turndown schedule that will trigger soon

SCHEDULE=$(cat <<EOF
apiVersion: kubecost.k8s.io/v1alpha1
kind: TurndownSchedule
metadata:
  name: turndown-pdb-bug-test-schedule
  finalizers:
  - "finalizer.kubecost.k8s.io"
spec:
  start: 2022-02-16T17:35:00Z
  end: 2022-02-16T18:00:00Z
  repeat: daily
EOF
)

echo $SCHEDULE | kubectl apply -f -

kubectl get tds

NAME                             STATE             NEXT TURNDOWN          NEXT TURN UP
turndown-pdb-bug-test-schedule   ScheduleSuccess   2022-02-16T17:35:00Z   2022-02-16T18:00:00Z

Wait for turndown to start and finish. Note in the logs that at least one node never finishes draining.

date -u --rfc-3339=seconds
echo

kubectl logs -n turndown -l app=cluster-turndown --tail=-1 | \
    grep -i 'Draininator' | \
    grep 'Draining Node\|Cordoning Node\|Drained Successfully'

2022-02-16 18:04:53+00:00

I0216 17:37:30.466494       1 namedlogger.go:24] [Draininator] Draining Node: gke-turndown-pdb-bug-default-pool-75d619d4-2wt5
I0216 17:37:30.466827       1 namedlogger.go:32]   [Draininator] Cordoning Node: gke-turndown-pdb-bug-default-pool-75d619d4-2wt5
I0216 17:37:46.155634       1 namedlogger.go:24] [Draininator] Node: gke-turndown-pdb-bug-default-pool-75d619d4-2wt5 was Drained Successfully
I0216 17:37:46.155643       1 namedlogger.go:24] [Draininator] Draining Node: gke-turndown-pdb-bug-default-pool-75d619d4-4msm
I0216 17:37:46.155647       1 namedlogger.go:32]   [Draininator] Cordoning Node: gke-turndown-pdb-bug-default-pool-75d619d4-4msm
I0216 17:38:11.992618       1 namedlogger.go:24] [Draininator] Node: gke-turndown-pdb-bug-default-pool-75d619d4-4msm was Drained Successfully
I0216 17:38:11.992975       1 namedlogger.go:24] [Draininator] Draining Node: gke-turndown-pdb-bug-default-pool-75d619d4-s2h9
I0216 17:38:11.993227       1 namedlogger.go:32]   [Draininator] Cordoning Node: gke-turndown-pdb-bug-default-pool-75d619d4-s2h9

Note in “kubectl get pods” that 2 of the deployment pods are still running and one is unschedulable.

kubectl get pods

NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-66b6c48dd5-9ct6l   0/1     Pending   0          26m
nginx-deployment-66b6c48dd5-lzs74   1/1     Running   0          26m
nginx-deployment-66b6c48dd5-wnqpp   1/1     Running   0          27m

Note in “kubectl get nodes” that after scaleup should have happened, we still have a turndown node and the 3 regular nodes are sitting around marked as noschedule.

date -u --rfc-3339=seconds
echo

kubectl get nodes

2022-02-16 18:05:16+00:00

NAME                                                  STATUS                     ROLES    AGE   VERSION
gke-turndown-pdb-bug-cluster-turndown-2ecd8eb5-pp8t   Ready                      <none>   29m   v1.21.6-gke.1500
gke-turndown-pdb-bug-default-pool-75d619d4-2wt5       Ready,SchedulingDisabled   <none>   58m   v1.21.6-gke.1500
gke-turndown-pdb-bug-default-pool-75d619d4-4msm       Ready,SchedulingDisabled   <none>   58m   v1.21.6-gke.1500
gke-turndown-pdb-bug-default-pool-75d619d4-s2h9       Ready,SchedulingDisabled   <none>   58m   v1.21.6-gke.1500

Possible solutions

At the very least, we should have a retry limit on evictions so there isn't an infinite loop that makes the turndown pod hang.

Real solutions could involve some sort of "force deletion" or notifying the user of the PDB's presence and asking them to make a modification.

mbolt35 commented 2 years ago

Ok, when I wrote the drain functionality, I referenced kubectl source code for cordoning a node. My guess is that the code just needs updating to support the newer kubernetes features/functionality. For reference, here's the latest kubectl drain code that would relevant for this fix (hint: it looks like there were some changes to handle a magnitude more error conditions).

https://github.com/kubernetes/kubectl/blob/a4aec62157e9fd73a038c9aab36822707277f00c/pkg/drain/drain.go#L272-L377

michaelmdresser commented 2 years ago

A related problem, which explains why drain loops forever instead of timing out like its supposed to (nice find, Bolt):

https://github.com/kubecost/cluster-turndown/blob/578ef725bde8620e1beb08ebab684a7127a260e8/pkg/cluster/draininator.go#L50

mbolt35 commented 2 years ago

I think the solution here is two-fold:

Update eviction logic to include new conditions noted from the kubectl repo link above
Use a reasonable timeout for each eviction.
- Possibly leverage updates to Kubernetes API with a context

I believe the globalTimeout addition was based on the older kubectl evict logic that didn't work properly. At least, that's what I'm going to tell myself 😞

Adam-Stack-PM commented 1 year ago

This issue has been marked as stale because it has not had recent activity. It will be closed if no further action occurs.

AjayTripathy commented 1 year ago

This still feels relevant, if nothing else as documentation

kubecost / cluster-turndown

Turndown fails on non-autoscaling clusters when PodDisruptionBudget(s) exist and eviction is possible #38

Description

Reproduce

Possible solutions