kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.2k stars 204 forks source link

Kured not rebooting node with example `var/run/reboot-required` file #952

Open chawleejay opened 4 months ago

chawleejay commented 4 months ago

Hello

I am trying to get Kured back up and running. The logs show time="2024-07-09T04:31:18Z" level=info msg="Reboot not required" time="2024-07-09T05:31:18Z" level=info msg="Reboot not required" time="2024-07-09T06:31:18Z" level=info msg="Reboot not required"

but there is a reboot-required file on the node.

Not sure why this is happening. Im using Kured v5.4.0

Thanks

ckotzbauer commented 4 months ago

Hi @chawleejay, can you please post your current Kured configuration and your installation method here? Otherwise we can't figure out what's happening, thanks.

chawleejay commented 4 months ago

kured is installed and pods are up and running. the pod logs show "reboot not required"

The node has the reboot required file placed inside via the command touch /var/run/reboot-required

image

  template:
    metadata:
      name: 'kured-{{name}}'
    spec:
      project: kured
      source:
        chart: kured
        helm:
          valueFiles:
            - values.yaml
          releaseName: kured
          values: |
            tolerations:
              - key: node-role.kubernetes.io/master
                effect: NoSchedule
              - key: workload-type
                value: confluent
                effect: NoSchedule       
            updateStrategy: RollingUpdate
            maxUnavailable: 1
            configuration:
              period: 5h0m0s    
              rebootDays: {{rebootDays}}    
              lockTtl: 30m    
              timeZone: America/Phoenix
              notifyUrl: {{notifyUrl}}
        repoURL: 'https://kubereboot.github.io/charts'
        targetRevision: 5.4.0
      destination:
        server: '{{address}}'
        namespace: '{{namespace}}'

@ckotzbauer

ckotzbauer commented 4 months ago

Okay, I'm still not sure how kured is configured in your installation, the yaml is not clear about that. Can you please post the output of kubectl get daemonset -n <namespace> kured -o yaml here?

chawleejay commented 4 months ago
creationTimestamp: "2022-09-07T17:14:52Z"
  generation: 16
  labels:
    app.kubernetes.io/instance: kured-devops
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kured
    helm.sh/chart: kured-5.4.0
    k8slens-edit-resource-version: v1
  name: kured
  namespace: kube-system
  resourceVersion: "3775521503"
  uid: b07427ea-5345-4bd0-bbaa-be3d4da149eb
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: kured
      app.kubernetes.io/name: kured
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: kured
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: kured
        helm.sh/chart: kured-5.4.0
    spec:
      containers:
      - args:
        - --ds-name=kured
        - --ds-namespace=kube-system
        - --metrics-port=8080
        - --lock-ttl=30m
        - --period=0h0m30s
        - --force-reboot=true
        - --reboot-command=/bin/systemctl reboot
        - --notify-url=slack://KuredDevOps@ourtoken
        - --time-zone=America/Phoenix
        - --log-format=text
        - --concurrency=1
        command:
        - /usr/bin/kured
        env:
        - name: KURED_NODE_ID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: ghcr.io/kubereboot/kured:1.15.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: metrics
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        name: kured
        ports:
        - containerPort: 8080
          hostPort: 8080
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: metrics
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      hostPID: true
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kured
      serviceAccountName: kured
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: workload-type
        value: confluent
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  observedGeneration: 16
  updatedNumberScheduled: 3

just added the --force-reboot=true today and still nothing. Thank you

jackfrancis commented 4 months ago

@chawleejay do you see this in the logs:

"sentinel command ended with unexpected exit code"...

If not, then based on your config it seems that test -f /var/run/reboot-required returned a 1 exit code, indicating that the file doesn't exist.

ryayon commented 3 months ago

Hello,

I have the same issue on Ubuntu nodes.

If I check the existence of the file directly on the node, I get:

$ test -f /var/run/reboot-required
$ echo $?
0

While, if I run the same command from the pod of the same node, I get:

# test -f /var/run/reboot-required
# echo $?
1

In addition, here is the content of /var/run in the pod:

# ls /var/run/
secrets
github-actions[bot] commented 1 month ago

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

evrardjp commented 3 weeks ago

Our CI works in the following way:

However, this should work by default: If you don't pass sentinel-command, it should watch for /var/run/reboot-required from nsentering pid1.

Did you try running the command /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required , and see its result?

urbaman commented 3 weeks ago

Hi

I have the same problem in a microk8s deploy on Ubuntu 24.04, node mk8s1, kured 1.16.0 deployed with manifests, the file was clearly present at the time of the logs:

# ls -la /var/run/reboot*
-rw-r--r-- 1 root root 32 Oct 17 06:36 /var/run/reboot-required
-rw-r--r-- 1 root root 40 Oct 17 06:36 /var/run/reboot-required.pkgs
# kubectl get pods -n kube-system -o wide
NAME                                       READY   STATUS    RESTARTS      AGE   IP             NODE    NOMINATED NODE   READINESS GATES
kured-6zxvf                                1/1     Running   6 (9d ago)    12d   10.1.217.210   mk8s3   <none>           <none>
kured-gss75                                1/1     Running   3 (9d ago)    12d   10.1.238.130   mk8s1   <none>           <none>
kured-z4wg8                                1/1     Running   2 (12d ago)   12d   10.1.115.130   mk8s2   <none>           <none>
# kubectl logs -n kube-system kured-gss75
time="2024-10-10T23:25:22Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2024-10-10T23:25:22Z" level=info msg="Kubernetes Reboot Daemon: 1.16.0"
time="2024-10-10T23:25:22Z" level=info msg="Node ID: mk8s1"
time="2024-10-10T23:25:22Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2024-10-10T23:25:22Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2024-10-10T23:25:22Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2024-10-10T23:25:22Z" level=info msg="PreferNoSchedule taint: "
time="2024-10-10T23:25:22Z" level=info msg="Blocking Pod Selectors: []"
time="2024-10-10T23:25:22Z" level=info msg="Reboot schedule: ---MonTueWedThuFri--- between 10:00 and 17:00 Europe/Rome"
time="2024-10-10T23:25:22Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s"
time="2024-10-10T23:25:22Z" level=info msg="Concurrency: 1"
time="2024-10-10T23:25:22Z" level=info msg="Reboot method: command"
time="2024-10-10T23:25:22Z" level=info msg="Reboot signal: 39"
time="2024-10-11T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T14:12:16Z" level=info msg="Reboot not required"
# kubectl exec -ti -n kube-system kured-gss75 -- /bin/sh
/ # test -f /var/run/reboot-required
/ # echo $?
1
/ # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required
/ # echo $?
0
# test -f /var/run/reboot-required
# echo $?
0