Closed disi closed 11 months ago
Here is the cue bit for timoni, I use:
"flux": {
module: {
url: "oci://ghcr.io/stefanprodan/modules/flux-aio"
version: "2.1.2"
}
namespace: "flux-system"
values: {
hostNetwork: true
securityProfile: "privileged"
controllers: notification: enabled: false
}
}
If this node goes down, the controllers are not deployed to other nodes.
Is the Kubernetes control plane still working? I expect it to reschedule the pod on a different node. Maybe the toleration we set in Flux is too broad, I set it like this https://github.com/stefanprodan/flux-aio/blob/aedf966e28a5f7170e3e737d7d52fa5815c8cfad/modules/flux-aio/templates/config.cue#L132
It may well be that since etcd has no quorum, the control plane will no longer schedule pods anywhere. I suggest creating a cluster with 2 worker nodes, deploy Flux on one of the workers, make that node fail and see if it gets reschedule to the healthy node.
I can still schedule pods. Weave Dashboard, AWX, kubernetes dashboard are all rescheduled to other nodes. Only Flux does not and shows "running".
If you describe the Flux pod, is there any hint in the events about some blocker to rescheduling?
Events show this
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 39m kubelet Created container kustomize-controller
Normal Pulled 39m kubelet Container image "ghcr.io/fluxcd/helm-controller:v0.36.2" already present on machine
Normal Pulled 39m kubelet Container image "ghcr.io/fluxcd/source-controller:v1.1.2" already present on machine
Normal Created 39m kubelet Created container source-controller
Normal Started 39m kubelet Started container source-controller
Normal Pulled 39m kubelet Container image "ghcr.io/fluxcd/kustomize-controller:v1.1.1" already present on machine
Normal SandboxChanged 39m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Started 39m kubelet Started container kustomize-controller
Normal Started 39m kubelet Started container helm-controller
Normal Created 39m kubelet Created container helm-controller
Warning Unhealthy 38m kubelet Liveness probe failed: Get "http://10.0.2.22:9794/healthz": dial tcp 10.0.2.22:9794: connect: connection refused
Warning Unhealthy 38m (x5 over 39m) kubelet Readiness probe failed: Get "http://10.0.2.22:9794/readyz": dial tcp 10.0.2.22:9794: connect: connection refused
Warning Unhealthy 38m kubelet Liveness probe failed: Get "http://10.0.2.22:9792/healthz": dial tcp 10.0.2.22:9792: connect: connection refused
Warning Unhealthy 38m (x8 over 39m) kubelet Readiness probe failed: Get "http://10.0.2.22:9790/": dial tcp 10.0.2.22:9790: connect: connection refused
Warning NodeNotReady 16m (x3 over 93m) node-controller Node is not ready
The full description:
[disi@vmalmakw1s ~]$ kubectl describe pod flux-57bd866b6d-zbrfc -n flux-system
Name: flux-57bd866b6d-zbrfc
Namespace: flux-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: flux
Node: vmalmakms.home/10.0.2.22
Start Time: Sat, 25 Nov 2023 22:11:15 +0000
Labels: app.kubernetes.io/name=flux
pod-template-hash=57bd866b6d
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: true
prometheus.io/scrape: true
Status: Running
IP: 10.0.2.22
IPs:
IP: 10.0.2.22
Controlled By: ReplicaSet/flux-57bd866b6d
Containers:
source-controller:
Container ID: containerd://f95a5ff962bb8b4697cc3b3b933b2c45499f594b65802f113aebb587ff822b61
Image: ghcr.io/fluxcd/source-controller:v1.1.2
Image ID: ghcr.io/fluxcd/source-controller@sha256:b776e085ac079bf22ed23afe2874aebd10efcfaa740ec25748774608bbc79932
Ports: 9790/TCP, 9791/TCP, 9792/TCP
Host Ports: 9790/TCP, 9791/TCP, 9792/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9791
--health-addr=:9792
--storage-addr=:9790
--storage-path=/data
--storage-adv-addr=flux.$(RUNTIME_NAMESPACE).svc.cluster.local.
--concurrent=5
--requeue-dependency=30s
--watch-label-selector=!sharding.fluxcd.io/key
--helm-cache-max-size=10
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m
State: Running
Started: Sun, 26 Nov 2023 08:33:56 +0000
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 26 Nov 2023 08:33:25 +0000
Finished: Sun, 26 Nov 2023 08:33:56 +0000
Ready: True
Restart Count: 15
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-sc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http-sc/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: flux-system (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/data from data (rw)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nznjt (ro)
kustomize-controller:
Container ID: containerd://997b139a3e85abd2da13c1d95fbb585bf6cfe29967bcd241d95a885493213971
Image: ghcr.io/fluxcd/kustomize-controller:v1.1.1
Image ID: ghcr.io/fluxcd/kustomize-controller@sha256:e2b3c9e1292564bbfaa513f3cc6fa1df1194fae8ba9483fbe581099d0c585d94
Ports: 9793/TCP, 9794/TCP
Host Ports: 9793/TCP, 9794/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9793
--health-addr=:9794
--watch-label-selector=!sharding.fluxcd.io/key
--concurrent=5
--requeue-dependency=30s
State: Running
Started: Sun, 26 Nov 2023 08:33:56 +0000
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 26 Nov 2023 08:33:25 +0000
Finished: Sun, 26 Nov 2023 08:33:56 +0000
Ready: True
Restart Count: 15
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-kc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz-kc/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: flux-system (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nznjt (ro)
helm-controller:
Container ID: containerd://c6a9a4ec46740520acc2f70a763259469830add59b7904a4ee0b00e8e97d2dd1
Image: ghcr.io/fluxcd/helm-controller:v0.36.2
Image ID: ghcr.io/fluxcd/helm-controller@sha256:6ee7e590e57350ac91cfdeee4587d0e9e6f52e723c56d4b7878c59279bd36f00
Ports: 9795/TCP, 9796/TCP
Host Ports: 9795/TCP, 9796/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9795
--health-addr=:9796
--watch-label-selector=!sharding.fluxcd.io/key
--concurrent=5
--requeue-dependency=30s
State: Running
Started: Sun, 26 Nov 2023 08:33:25 +0000
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Sun, 26 Nov 2023 07:49:37 +0000
Finished: Sun, 26 Nov 2023 08:33:09 +0000
Ready: True
Restart Count: 13
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-hc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz-hc/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: flux-system (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nznjt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-nznjt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 39m kubelet Created container kustomize-controller
Normal Pulled 39m kubelet Container image "ghcr.io/fluxcd/helm-controller:v0.36.2" already present on machine
Normal Pulled 39m kubelet Container image "ghcr.io/fluxcd/source-controller:v1.1.2" already present on machine
Normal Created 39m kubelet Created container source-controller
Normal Started 39m kubelet Started container source-controller
Normal Pulled 39m kubelet Container image "ghcr.io/fluxcd/kustomize-controller:v1.1.1" already present on machine
Normal SandboxChanged 39m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Started 39m kubelet Started container kustomize-controller
Normal Started 39m kubelet Started container helm-controller
Normal Created 39m kubelet Created container helm-controller
Warning Unhealthy 38m kubelet Liveness probe failed: Get "http://10.0.2.22:9794/healthz": dial tcp 10.0.2.22:9794: connect: connection refused
Warning Unhealthy 38m (x5 over 39m) kubelet Readiness probe failed: Get "http://10.0.2.22:9794/readyz": dial tcp 10.0.2.22:9794: connect: connection refused
Warning Unhealthy 38m kubelet Liveness probe failed: Get "http://10.0.2.22:9792/healthz": dial tcp 10.0.2.22:9792: connect: connection refused
Warning Unhealthy 38m (x8 over 39m) kubelet Readiness probe failed: Get "http://10.0.2.22:9790/": dial tcp 10.0.2.22:9790: connect: connection refused
Warning NodeNotReady 16m (x3 over 93m) node-controller Node is not ready
Hmm why is Status: Running
if the Liveness probe fails. Can you also describe the ReplicaSet/flux-57bd866b6d
and the flux deployment please.
Replicaset
Name: flux-57bd866b6d
Namespace: flux-system
Selector: app.kubernetes.io/name=flux,pod-template-hash=57bd866b6d
Labels: app.kubernetes.io/name=flux
pod-template-hash=57bd866b6d
Annotations: app.kubernetes.io/role: cluster-admin
deployment.kubernetes.io/desired-replicas: 1
deployment.kubernetes.io/max-replicas: 1
deployment.kubernetes.io/revision: 1
Controlled By: Deployment/flux
Replicas: 1 current / 1 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/name=flux
pod-template-hash=57bd866b6d
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: true
prometheus.io/scrape: true
Service Account: flux
Containers:
source-controller:
Image: ghcr.io/fluxcd/source-controller:v1.1.2
Ports: 9790/TCP, 9791/TCP, 9792/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9791
--health-addr=:9792
--storage-addr=:9790
--storage-path=/data
--storage-adv-addr=flux.$(RUNTIME_NAMESPACE).svc.cluster.local.
--concurrent=5
--requeue-dependency=30s
--watch-label-selector=!sharding.fluxcd.io/key
--helm-cache-max-size=10
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-sc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http-sc/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/data from data (rw)
/tmp from tmp (rw)
kustomize-controller:
Image: ghcr.io/fluxcd/kustomize-controller:v1.1.1
Ports: 9793/TCP, 9794/TCP
Host Ports: 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9793
--health-addr=:9794
--watch-label-selector=!sharding.fluxcd.io/key
--concurrent=5
--requeue-dependency=30s
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-kc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz-kc/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/tmp from tmp (rw)
helm-controller:
Image: ghcr.io/fluxcd/helm-controller:v0.36.2
Ports: 9795/TCP, 9796/TCP
Host Ports: 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9795
--health-addr=:9796
--watch-label-selector=!sharding.fluxcd.io/key
--concurrent=5
--requeue-dependency=30s
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-hc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz-hc/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/tmp from tmp (rw)
Volumes:
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Priority Class Name: system-cluster-critical
Events: <none>
Deployment
Name: flux
Namespace: flux-system
CreationTimestamp: Mon, 20 Nov 2023 15:16:57 +0000
Labels: app.kubernetes.io/managed-by=timoni
app.kubernetes.io/name=flux
app.kubernetes.io/part-of=flux
app.kubernetes.io/version=v2.1.2
instance.timoni.sh/name=flux
instance.timoni.sh/namespace=flux-system
Annotations: app.kubernetes.io/role: cluster-admin
deployment.kubernetes.io/revision: 1
Selector: app.kubernetes.io/name=flux
Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType: Recreate
MinReadySeconds: 0
Pod Template:
Labels: app.kubernetes.io/name=flux
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: true
prometheus.io/scrape: true
Service Account: flux
Containers:
source-controller:
Image: ghcr.io/fluxcd/source-controller:v1.1.2
Ports: 9790/TCP, 9791/TCP, 9792/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9791
--health-addr=:9792
--storage-addr=:9790
--storage-path=/data
--storage-adv-addr=flux.$(RUNTIME_NAMESPACE).svc.cluster.local.
--concurrent=5
--requeue-dependency=30s
--watch-label-selector=!sharding.fluxcd.io/key
--helm-cache-max-size=10
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-sc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http-sc/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/data from data (rw)
/tmp from tmp (rw)
kustomize-controller:
Image: ghcr.io/fluxcd/kustomize-controller:v1.1.1
Ports: 9793/TCP, 9794/TCP
Host Ports: 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9793
--health-addr=:9794
--watch-label-selector=!sharding.fluxcd.io/key
--concurrent=5
--requeue-dependency=30s
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-kc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz-kc/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/tmp from tmp (rw)
helm-controller:
Image: ghcr.io/fluxcd/helm-controller:v0.36.2
Ports: 9795/TCP, 9796/TCP
Host Ports: 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Args:
--watch-all-namespaces
--log-level=info
--log-encoding=json
--enable-leader-election=false
--metrics-addr=:9795
--health-addr=:9796
--watch-label-selector=!sharding.fluxcd.io/key
--concurrent=5
--requeue-dependency=30s
Limits:
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz-hc/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz-hc/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SOURCE_CONTROLLER_LOCALHOST: localhost:9790
RUNTIME_NAMESPACE: (v1:metadata.namespace)
TUF_ROOT: /tmp/.sigstore
NO_PROXY: .cluster.local.,.cluster.local,.svc
Mounts:
/tmp from tmp (rw)
Volumes:
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available False MinimumReplicasUnavailable
OldReplicaSets: <none>
NewReplicaSet: flux-57bd866b6d (1/1 replicas created)
Events: <none>
Really odd, the replicaset says Pods Status: 1 Running
which is really strange, but the Deployment say Replicas: 1 unavailable
but it doesn't create a new replicaset.
I will see later today, if I start over and redeploy the entire cluster. Then test again and see if it has the same behaviour.
I guess if you delete pod it will get rescheduled, this looks like some race condition in the Kubernetes scheduler or the toleration makes it trip.
Correct :) I did delete the pod.
flux-system flux-57bd866b6d-j6z7x 3/3 Running 0 56s 10.0.2.24 vmalmakw2s.home <none> <none>
flux-system flux-57bd866b6d-zbrfc 3/3 Terminating 43 (75m ago) 11h 10.0.2.22 vmalmakms.home <none> <none>
A new was created and it's working fine. But it did not happen automatically.
Hmm so it looks like it got stuck in Terminating, but why wasn't this status reflected in the replicaset and why it didn't timeout. I wander if this is some bug in Kubernetes.
Terminating status only after ran:
$ kubectl delete pod flux-57bd866b6d-zbrfc -n flux-system
i.e. the kubernetes dashboard and other pods also linger some time in terminating. Default is some 15min, before those get removed by Kubernetes? edit, removes it immediately :
$ kubectl delete pod --force flux-57bd866b6d-zbrfc -n flux-system
If you managed to reproduce this, it would be good to take snapshots of the deployment and replicaset and see what events are issued for those, I guess those expired and that's why there are no events listed now.
If you can reproduce this, please add with kubectl edit a toleration like so, and retest please:
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 30
Here some attempts to log events: Flux is now running on vmalmakw2s and if I shut that down, there is no event to the Replicaset or Deployment after ~15min. I ran this:
$ watch "kubectl describe replicasets.apps -n flux-system flux-57bd866b6d | grep Events"
And no events on the Deployment either. I then started the node again and still no events. Then I shutdown the node with awx-operator deployment and monitored. The only event on the new ReplicaSet for awx-operator that is deployed after ~6min:
Normal SuccessfulCreate 89s replicaset-controller Created pod: awx-operator-controller-manager-5cd65bb78d-7wn64`
I hope this helps.
Now, I'll edit Flux as you stated above and test again.
[disi@vmalmakw1s ~]$ kubectl edit deployments.apps -n flux-system flux
deployment.apps/flux edited
Still running fine. Tested sync with git and events. Deployment log:
Normal ScalingReplicaSet 3m12s deployment-controller Scaled down replica set flux-57bd866b6d to 0 from 1
Normal ScalingReplicaSet 3m12s deployment-controller Scaled up replica set flux-5c4dd674fc to 1
ReplicaSet log:
Normal SuccessfulCreate 6m11s replicaset-controller Created pod: flux-57bd866b6d-qj947
Normal SuccessfulDelete 3m49s replicaset-controller Deleted pod: flux-57bd866b6d-qj947
Running on node "vmalmakms". Monitoring:
$ watch "kubectl describe replicasets.apps -n flux-system flux-57bd866b6d | grep -A 6 Events"
$ watch "kubectl describe deployments.apps -n flux-system flux | grep -A 6 Events"
Now shutting down "vmalmakms"... It's working :) New pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 62s default-scheduler Successfully assigned flux-system/flux-5c4dd674fc-dfpqk to vmalmakw2s.home
Normal Pulled 62s kubelet Container image "ghcr.io/fluxcd/source-controller:v1.1.2" already present on machine
Normal Created 62s kubelet Created container source-controller
Normal Started 61s kubelet Started container source-controller
Normal Pulled 61s kubelet Container image "ghcr.io/fluxcd/kustomize-controller:v1.1.1" already present on machine
Normal Created 61s kubelet Created container kustomize-controller
Normal Started 61s kubelet Started container kustomize-controller
Normal Pulled 61s kubelet Container image "ghcr.io/fluxcd/helm-controller:v0.36.2" already present on machine
Normal Created 61s kubelet Created container helm-controller
Normal Started 61s kubelet Started container helm-controller
old pod
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m36s default-scheduler Successfully assigned flux-system/flux-5c4dd674fc-bxx65 to vmalmakms.home
Normal Pulled 9m36s kubelet Container image "ghcr.io/fluxcd/source-controller:v1.1.2" already present on machine
Normal Created 9m36s kubelet Created container source-controller
Normal Started 9m36s kubelet Started container source-controller
Normal Pulled 9m36s kubelet Container image "ghcr.io/fluxcd/kustomize-controller:v1.1.1" already present on machine
Normal Created 9m36s kubelet Created container kustomize-controller
Normal Started 9m36s kubelet Started container kustomize-controller
Normal Pulled 9m36s kubelet Container image "ghcr.io/fluxcd/helm-controller:v0.36.2" already present on machine
Normal Created 9m36s kubelet Created container helm-controller
Normal Started 9m35s kubelet Started container helm-controller
Warning NodeNotReady 64s node-controller Node is not ready
No events on the Deployment, a new ReplicaSet is created:
flux-system flux-57bd866b6d 0 0 0 5d20h
flux-system flux-5c4dd674fc 1 1 1 15m
Ok so the tolerationSeconds: 30
made it reschedule? And without it, it stays dead on the failing node?
Ok so the
tolerationSeconds: 30
made it reschedule? And without it, it stays dead on the failing node?
Hi, yes, without this parameter it just stays there forever in running state. I would probably change it to ~5min as the standard setting of Kubernetes? It now reschedules way ahead of other pods.
Thanks @disi for all the tests. I have published the fix, rerunning timoni bundle apply
should set the right tolerations now.
In my testing, I created a cluster of three master nodes, all are untainted and can schedule normal pods.
Flux is only ever running on the node it was originally deployed on via timoni. If this node goes down, the controllers are not deployed to other nodes.
flux events - shows logs until the node went down. The pods show running on the node that is down:
stream logs failed Get "https://10.0.2.22:10250/containerLogs/flux-system/flux-57bd866b6d-zbrfc/helm-controller?follow=true&sinceSeconds=300&tailLines=100×tamps=true": dial tcp 10.0.2.22:10250: connect: