kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.14k stars 201 forks source link

Kured not rebooting node with example `var/run/reboot-required` file #952

Open chawleejay opened 1 month ago

chawleejay commented 1 month ago

Hello

I am trying to get Kured back up and running. The logs show time="2024-07-09T04:31:18Z" level=info msg="Reboot not required" time="2024-07-09T05:31:18Z" level=info msg="Reboot not required" time="2024-07-09T06:31:18Z" level=info msg="Reboot not required"

but there is a reboot-required file on the node.

Not sure why this is happening. Im using Kured v5.4.0

Thanks

ckotzbauer commented 1 month ago

Hi @chawleejay, can you please post your current Kured configuration and your installation method here? Otherwise we can't figure out what's happening, thanks.

chawleejay commented 1 month ago

kured is installed and pods are up and running. the pod logs show "reboot not required"

The node has the reboot required file placed inside via the command touch /var/run/reboot-required

image

  template:
    metadata:
      name: 'kured-{{name}}'
    spec:
      project: kured
      source:
        chart: kured
        helm:
          valueFiles:
            - values.yaml
          releaseName: kured
          values: |
            tolerations:
              - key: node-role.kubernetes.io/master
                effect: NoSchedule
              - key: workload-type
                value: confluent
                effect: NoSchedule       
            updateStrategy: RollingUpdate
            maxUnavailable: 1
            configuration:
              period: 5h0m0s    
              rebootDays: {{rebootDays}}    
              lockTtl: 30m    
              timeZone: America/Phoenix
              notifyUrl: {{notifyUrl}}
        repoURL: 'https://kubereboot.github.io/charts'
        targetRevision: 5.4.0
      destination:
        server: '{{address}}'
        namespace: '{{namespace}}'

@ckotzbauer

ckotzbauer commented 1 month ago

Okay, I'm still not sure how kured is configured in your installation, the yaml is not clear about that. Can you please post the output of kubectl get daemonset -n <namespace> kured -o yaml here?

chawleejay commented 1 month ago
creationTimestamp: "2022-09-07T17:14:52Z"
  generation: 16
  labels:
    app.kubernetes.io/instance: kured-devops
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kured
    helm.sh/chart: kured-5.4.0
    k8slens-edit-resource-version: v1
  name: kured
  namespace: kube-system
  resourceVersion: "3775521503"
  uid: b07427ea-5345-4bd0-bbaa-be3d4da149eb
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: kured
      app.kubernetes.io/name: kured
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: kured
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: kured
        helm.sh/chart: kured-5.4.0
    spec:
      containers:
      - args:
        - --ds-name=kured
        - --ds-namespace=kube-system
        - --metrics-port=8080
        - --lock-ttl=30m
        - --period=0h0m30s
        - --force-reboot=true
        - --reboot-command=/bin/systemctl reboot
        - --notify-url=slack://KuredDevOps@ourtoken
        - --time-zone=America/Phoenix
        - --log-format=text
        - --concurrency=1
        command:
        - /usr/bin/kured
        env:
        - name: KURED_NODE_ID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: ghcr.io/kubereboot/kured:1.15.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: metrics
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        name: kured
        ports:
        - containerPort: 8080
          hostPort: 8080
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: metrics
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      hostPID: true
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kured
      serviceAccountName: kured
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: workload-type
        value: confluent
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  observedGeneration: 16
  updatedNumberScheduled: 3

just added the --force-reboot=true today and still nothing. Thank you

jackfrancis commented 1 month ago

@chawleejay do you see this in the logs:

"sentinel command ended with unexpected exit code"...

If not, then based on your config it seems that test -f /var/run/reboot-required returned a 1 exit code, indicating that the file doesn't exist.

ryayon commented 4 weeks ago

Hello,

I have the same issue on Ubuntu nodes.

If I check the existence of the file directly on the node, I get:

$ test -f /var/run/reboot-required
$ echo $?
0

While, if I run the same command from the pod of the same node, I get:

# test -f /var/run/reboot-required
# echo $?
1

In addition, here is the content of /var/run in the pod:

# ls /var/run/
secrets