kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.14k stars 201 forks source link

Kured is not uncordoning / removing label after reboot #955

Open ggggut opened 1 month ago

ggggut commented 1 month ago

Hello there, i recently discoverd an issue with kured. Out of 10 reboots, 2-3 nodes wont be uncordoned after the reboot (different ones each time, not the same ones).

The label is also not reset (need to change it manually). The uncordoning step gets just skipped.

I also cant see an Unable to uncordon ... error. So it seems that kured doesnt even know that the node is cordoned?

Here is a log after an unsuccesfull reboot:

time="2024-07-17T07:07:05Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
{"level":"info","msg":"Kubernetes Reboot Daemon: 1.15.1","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Node ID: node1.example.com","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Lock Annotation: node-remediation/kured:weave.works/kured-node-lock","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Lock TTL not set, lock will remain until being released","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Lock release delay not set, lock will be released immediately after rebooting","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"PreferNoSchedule taint: ","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Blocking Pod Selectors: []","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Reboot schedule: ---MonTueWedThu------ between 08:00 and 18:00 Europe/Berlin","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Reboot check command: [test -f /sentinel/reboot-required] every 20s","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Concurrency: 1","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Reboot method: command","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Reboot signal: 39","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Will annotate nodes during kured reboot operations","time":"2024-07-17T07:07:05Z"}
{"level":"info","msg":"Holding lock","time":"2024-07-17T07:08:24Z"}
{"level":"info","msg":"Deleting node node1.example.com annotation weave.works/kured-reboot-in-progress","time":"2024-07-17T07:08:24Z"}
{"level":"info","msg":"Releasing lock","time":"2024-07-17T07:08:24Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:08:40Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:09:00Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:09:20Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:09:40Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:10:00Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:10:20Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:10:40Z"}
{"level":"info","msg":"Reboot not required","time":"2024-07-17T07:11:00Z"}

These are my chart values:

  values:
    tolerations:
      - operator: "Exists"
    maxUnavailable: 2
    metrics:
      create: true
    service:
      create: true
    configuration:
      lockAnnotation: "weave.works/kured-node-lock"
      period: 20s
      rebootDelay: 60s
      rebootDays: [mo,tu,we,th]
      startTime: "8:00"
      endTime: "18:00"
      timeZone: "Europe/Berlin"
      logFormat: "json"
      annotateNodes: true
      drainDelay: 30s
      preRebootNodeLabels:
        - kuredreboot=inprogress
      postRebootNodeLabels:
        - kuredreboot=idle

I noticed, once i recreate the DS that its working for a while before breaking again. This could be a coincidence though..

I tried looking through the code myself but cant really make out why this happens.

Maybe one of you has an idea what might be happening 🤔 Maybe its a simple thing im overseeing..

ggggut commented 1 month ago

I also noticed, if it happens to one host it will happen to every other host following. Unless you recreate the DaemonSet.

Here is the DS manifest:

Name:           kured
Selector:       app.kubernetes.io/instance=kured,app.kubernetes.io/name=kured
Node-Selector:  kubernetes.io/os=linux
Labels:         app.kubernetes.io/instance=kured
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=kured
                helm.sh/chart=kured-5.4.5
                helm.toolkit.fluxcd.io/name=kured
                helm.toolkit.fluxcd.io/namespace=node-remediation
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: kured
                meta.helm.sh/release-namespace: node-remediation
                weave.works/kured-node-lock:
                  {"nodeID":"node.example.com","metadata":{"unschedulable":false},"created":"2024-07-17T12:40:44.917242595Z","TTL":0}
Desired Number of Nodes Scheduled: 20
Current Number of Nodes Scheduled: 20
Number of Nodes Scheduled with Up-to-date Pods: 20
Number of Nodes Scheduled with Available Pods: 20
Number of Nodes Misscheduled: 0
Pods Status:  20 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/instance=kured
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=kured
                    helm.sh/chart=kured-5.4.5
  Service Account:  kured
  Containers:
   kured:
    Image:      ghcr.io/kubereboot/kured:1.15.1
    Port:       8080/TCP
    Host Port:  0/TCP
    Command:
      /usr/bin/kured
    Args:
      --ds-name=kured
      --ds-namespace=node-remediation
      --metrics-port=8080
      --end-time=18:00
      --lock-annotation=weave.works/kured-node-lock
      --period=20s
      --drain-delay=30s
      --reboot-days=mo
      --reboot-days=tu
      --reboot-days=we
      --reboot-days=th
      --pre-reboot-node-labels=kuredreboot=inprogress
      --post-reboot-node-labels=kuredreboot=idle
      --reboot-sentinel=/sentinel/reboot-required
      --reboot-command=/bin/systemctl reboot
      --reboot-delay=60s
      --start-time=8:00
      --time-zone=Europe/Berlin
      --annotate-nodes=true
      --log-format=json
      --concurrency=1
    Liveness:   http-get http://:metrics/metrics delay=10s timeout=5s period=30s #success=1 #failure=5
    Readiness:  http-get http://:metrics/metrics delay=10s timeout=5s period=30s #success=1 #failure=5
    Environment:
      KURED_NODE_ID:    (v1:spec.nodeName)
    Mounts:
      /sentinel from sentinel (ro)
  Volumes:
   sentinel:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run
    HostPathType:  Directory
  Node-Selectors:  kubernetes.io/os=linux
  Tolerations:     op=Exists