Closed vgunapati closed 4 years ago
Seems like the problem here is with FailEvent()
func which does not do event.SetEventCompleted(true)
when it should actually mark the event completed also when it fails.
Otherwise you get an infinite loop of extending heartbeat.
Outside of fixing this we should probably add some maximum number of extensions to heartbeat to avoid this from reoccuring.
Testing this should be easy, you can nuke a node by running something like:
kind: Pod
metadata:
name: node-nuke
namespace: default
spec:
hostNetwork: true
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-10-10-10-10 <<<< CHANGE TO NODE NAME
containers:
- name: busybox
securityContext:
privileged: true
image: busybox
args:
- ip
- link
- set
- dev
- eth0
- down
This will make draining it fail with All attempts fail:#1: command execution timed out
and thus reproduce this issue.
After fixing this we should verify this is no longer happening
Is this a BUG REPORT or FEATURE REQUEST?: BUG What happened:
What you expected to happen: When result is ABANDON lifecycle-manager should mark lifecycle event completed
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
Other debugging information (if applicable):