keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
93 stars 28 forks source link

extending lifecycle event forever #23

Closed vgunapati closed 4 years ago

vgunapati commented 4 years ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG What happened:

time="2019-11-26T23:38:01Z" level=warning msg="failed to drain node: All attempts fail:\n#1: command execution timed out"
time="2019-11-26T23:38:01Z" level=debug msg="publishing event: EventReasonNodeDrainFailed"
time="2019-11-26T23:38:01Z" level=error msg="event c****-**** has failed processing: All attempts fail:\n#1: command execution timed out"
time="2019-11-26T23:38:01Z" level=debug msg="publishing event: EventLifecycleHookFailed"
time="2019-11-26T23:38:01Z" level=warning msg="abandoning instance i-4431341241343141341f"
time="2019-11-26T23:38:01Z" level=info msg="setting lifecycle event as completed with result: ABANDON"
time="2019-11-26T23:38:01Z" level=info msg="sending heartbeat for i-4431341241343141341f"
time="2019-11-26T23:38:01Z" level=debug msg="extending lifecycle event for i-4431341241343141341f"
time="2019-11-26T23:38:02Z" level=debug msg="deleting message with receipt ID adasdasdad...........jk"
time="2019-11-26T23:38:11Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:38:11Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:38:11Z" level=debug msg="active goroutines: 9"
time="2019-11-26T23:38:21Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:38:21Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:38:21Z" level=debug msg="active goroutines: 9"
time="2019-11-26T23:38:31Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:38:31Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:38:31Z" level=debug msg="active goroutines: 9"
time="2019-11-26T23:38:32Z" level=info msg="sending heartbeat for i-4431341241343141341f"
time="2019-11-26T23:38:32Z" level=debug msg="extending lifecycle event for i-4431341241343141341f"
time="2019-11-26T23:38:41Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:38:41Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:38:41Z" level=debug msg="active goroutines: 9"
time="2019-11-26T23:38:51Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:38:51Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:38:51Z" level=debug msg="active goroutines: 9"
time="2019-11-26T23:39:01Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:39:01Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:39:01Z" level=debug msg="active goroutines: 9"
time="2019-11-26T23:39:02Z" level=info msg="sending heartbeat for i-4431341241343141341f"
time="2019-11-26T23:39:02Z" level=debug msg="extending lifecycle event for i-4431341241343141341f"
time="2019-11-26T23:39:11Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:39:11Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:39:11Z" level=debug msg="active goroutines: 7"
time="2019-11-26T23:39:21Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:39:21Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:39:21Z" level=debug msg="active goroutines: 7"
time="2019-11-26T23:39:31Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:39:31Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:39:31Z" level=debug msg="active goroutines: 7"
time="2019-11-26T23:39:32Z" level=info msg="sending heartbeat for i-4431341241343141341f"
time="2019-11-26T23:39:32Z" level=debug msg="extending lifecycle event for i-4431341241343141341f"
time="2019-11-26T23:39:41Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:39:41Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:39:41Z" level=debug msg="active goroutines: 6"
time="2019-11-26T23:39:51Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:39:51Z" level=debug msg="polling for messages from queue"
time="2019-11-26T23:39:51Z" level=debug msg="active goroutines: 6"
time="2019-11-26T23:40:01Z" level=debug msg="no messages received in interval"
time="2019-11-26T23:40:01Z" level=debug msg="polling for messages from queue"

What you expected to happen: When result is ABANDON lifecycle-manager should mark lifecycle event completed

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

clientVersion:
  buildDate: "2019-10-15T23:41:55Z"
  compiler: gc
  gitCommit: c97fe5036ef3df2967d086711e6c0c405941e14b
  gitTreeState: clean
  gitVersion: v1.16.2
  goVersion: go1.12.10
  major: "1"
  minor: "16"
  platform: darwin/amd64
serverVersion:
  buildDate: "2019-10-15T12:02:12Z"
  compiler: gc
  gitCommit: 211047e9a1922595eaa3a1127ed365e9299a6c23
  gitTreeState: clean
  gitVersion: v1.14.8
  goVersion: go1.12.10
  major: "1"
  minor: "14"
  platform: linux/amd64

Other debugging information (if applicable):

kubectl logs <lifecycle-manager-pod>
eytan-avisror commented 4 years ago

Seems like the problem here is with FailEvent() func which does not do event.SetEventCompleted(true) when it should actually mark the event completed also when it fails. Otherwise you get an infinite loop of extending heartbeat.

Outside of fixing this we should probably add some maximum number of extensions to heartbeat to avoid this from reoccuring.

eytan-avisror commented 4 years ago

Testing this should be easy, you can nuke a node by running something like:

kind: Pod
metadata:
  name: node-nuke
  namespace: default
spec:
  hostNetwork: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node-10-10-10-10    <<<< CHANGE TO NODE NAME
  containers:
  - name: busybox
    securityContext:
      privileged: true
    image: busybox
    args:
    - ip
    - link
    - set
    - dev
    - eth0
    - down

This will make draining it fail with All attempts fail:#1: command execution timed out and thus reproduce this issue.

After fixing this we should verify this is no longer happening