Open chawleejay opened 4 months ago
Hi @chawleejay, can you please post your current Kured configuration and your installation method here? Otherwise we can't figure out what's happening, thanks.
kured is installed and pods are up and running. the pod logs show "reboot not required"
The node has the reboot required file placed inside via the command touch /var/run/reboot-required
template:
metadata:
name: 'kured-{{name}}'
spec:
project: kured
source:
chart: kured
helm:
valueFiles:
- values.yaml
releaseName: kured
values: |
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
- key: workload-type
value: confluent
effect: NoSchedule
updateStrategy: RollingUpdate
maxUnavailable: 1
configuration:
period: 5h0m0s
rebootDays: {{rebootDays}}
lockTtl: 30m
timeZone: America/Phoenix
notifyUrl: {{notifyUrl}}
repoURL: 'https://kubereboot.github.io/charts'
targetRevision: 5.4.0
destination:
server: '{{address}}'
namespace: '{{namespace}}'
@ckotzbauer
Okay, I'm still not sure how kured is configured in your installation, the yaml is not clear about that. Can you please post the output of kubectl get daemonset -n <namespace> kured -o yaml
here?
creationTimestamp: "2022-09-07T17:14:52Z"
generation: 16
labels:
app.kubernetes.io/instance: kured-devops
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kured
helm.sh/chart: kured-5.4.0
k8slens-edit-resource-version: v1
name: kured
namespace: kube-system
resourceVersion: "3775521503"
uid: b07427ea-5345-4bd0-bbaa-be3d4da149eb
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: kured
app.kubernetes.io/name: kured
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: kured
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kured
helm.sh/chart: kured-5.4.0
spec:
containers:
- args:
- --ds-name=kured
- --ds-namespace=kube-system
- --metrics-port=8080
- --lock-ttl=30m
- --period=0h0m30s
- --force-reboot=true
- --reboot-command=/bin/systemctl reboot
- --notify-url=slack://KuredDevOps@ourtoken
- --time-zone=America/Phoenix
- --log-format=text
- --concurrency=1
command:
- /usr/bin/kured
env:
- name: KURED_NODE_ID
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: ghcr.io/kubereboot/kured:1.15.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /metrics
port: metrics
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
name: kured
ports:
- containerPort: 8080
hostPort: 8080
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 5
httpGet:
path: /metrics
port: metrics
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
resources: {}
securityContext:
privileged: true
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
hostPID: true
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: kured
serviceAccountName: kured
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: workload-type
value: confluent
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 3
desiredNumberScheduled: 3
numberAvailable: 3
numberMisscheduled: 0
numberReady: 3
observedGeneration: 16
updatedNumberScheduled: 3
just added the --force-reboot=true today and still nothing. Thank you
@chawleejay do you see this in the logs:
"sentinel command ended with unexpected exit code"...
If not, then based on your config it seems that test -f /var/run/reboot-required
returned a 1 exit code, indicating that the file doesn't exist.
Hello,
I have the same issue on Ubuntu nodes.
If I check the existence of the file directly on the node, I get:
$ test -f /var/run/reboot-required
$ echo $?
0
While, if I run the same command from the pod of the same node, I get:
# test -f /var/run/reboot-required
# echo $?
1
In addition, here is the content of /var/run
in the pod:
# ls /var/run/
secrets
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).
Our CI works in the following way:
- --reboot-sentinel=/sentinel/reboot-required
However, this should work by default: If you don't pass sentinel-command, it should watch for /var/run/reboot-required from nsentering pid1.
Did you try running the command /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required
, and see its result?
Hi
I have the same problem in a microk8s deploy on Ubuntu 24.04, node mk8s1, kured 1.16.0 deployed with manifests, the file was clearly present at the time of the logs:
# ls -la /var/run/reboot*
-rw-r--r-- 1 root root 32 Oct 17 06:36 /var/run/reboot-required
-rw-r--r-- 1 root root 40 Oct 17 06:36 /var/run/reboot-required.pkgs
# kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kured-6zxvf 1/1 Running 6 (9d ago) 12d 10.1.217.210 mk8s3 <none> <none>
kured-gss75 1/1 Running 3 (9d ago) 12d 10.1.238.130 mk8s1 <none> <none>
kured-z4wg8 1/1 Running 2 (12d ago) 12d 10.1.115.130 mk8s2 <none> <none>
# kubectl logs -n kube-system kured-gss75
time="2024-10-10T23:25:22Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2024-10-10T23:25:22Z" level=info msg="Kubernetes Reboot Daemon: 1.16.0"
time="2024-10-10T23:25:22Z" level=info msg="Node ID: mk8s1"
time="2024-10-10T23:25:22Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2024-10-10T23:25:22Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2024-10-10T23:25:22Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2024-10-10T23:25:22Z" level=info msg="PreferNoSchedule taint: "
time="2024-10-10T23:25:22Z" level=info msg="Blocking Pod Selectors: []"
time="2024-10-10T23:25:22Z" level=info msg="Reboot schedule: ---MonTueWedThuFri--- between 10:00 and 17:00 Europe/Rome"
time="2024-10-10T23:25:22Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s"
time="2024-10-10T23:25:22Z" level=info msg="Concurrency: 1"
time="2024-10-10T23:25:22Z" level=info msg="Reboot method: command"
time="2024-10-10T23:25:22Z" level=info msg="Reboot signal: 39"
time="2024-10-11T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T14:12:16Z" level=info msg="Reboot not required"
# kubectl exec -ti -n kube-system kured-gss75 -- /bin/sh
/ # test -f /var/run/reboot-required
/ # echo $?
1
/ # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required
/ # echo $?
0
# test -f /var/run/reboot-required
# echo $?
0
Hello
I am trying to get Kured back up and running. The logs show
time="2024-07-09T04:31:18Z" level=info msg="Reboot not required" time="2024-07-09T05:31:18Z" level=info msg="Reboot not required" time="2024-07-09T06:31:18Z" level=info msg="Reboot not required"
but there is a reboot-required file on the node.
Not sure why this is happening. Im using Kured v5.4.0
Thanks