Open efowel opened 3 years ago
I am experiencing the same problem. We are running version 3.4. We also have this in 2 different clusters, but only experiencing this in one of them.
Name: kiam-server-9kn8g
Namespace: infrastructure
Priority: 90000000
Priority Class Name: node-critical-priority
Node: ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal/xxx.xxx.xxx.xxx
Start Time: Wed, 28 Jul 2021 10:51:42 +0200
Labels: app=kiam
component=server
controller-revision-hash=59768d7cb6
pod-template-generation=1
release=kiam
Annotations: fluentbit.io/exclude: true
kubernetes.io/psp: eks.privileged
Status: Running
IP: xxx.xxx.xxx.xxx
IPs:
IP: xxx.xxx.xxx.xxx
Controlled By: DaemonSet/kiam-server
Containers:
kiam-server:
Container ID: docker://a78002b145cf9cccec54a822536654581e15a0d24d863fe47f88219e3722809d
Image: quay.io/uswitch/kiam:v3.4
Image ID: docker-pullable://quay.io/uswitch/kiam@sha256:b24ef28b4a06371d10b6b9fea8a2d0a3b342dbf4928e798c07a208995e3945e3
Port: <none>
Host Port: <none>
Command:
/kiam
server
Args:
--json-log
--level=info
--bind=0.0.0.0:443
--cert=/etc/kiam/tls/cert
--key=/etc/kiam/tls/key
--ca=/etc/kiam/tls/ca
--role-base-arn-autodetect
--session-duration=15m
--sync=1m
--prometheus-listen-addr=0.0.0.0:9620
--prometheus-sync-interval=5s
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 28 Jul 2021 11:44:59 +0200
Finished: Wed, 28 Jul 2021 11:45:38 +0200
Ready: False
Restart Count: 21
Requests:
cpu: 100m
memory: 100Mi
Liveness: exec [/kiam health --cert=/etc/kiam/tls/cert --key=/etc/kiam/tls/key --ca=/etc/kiam/tls/ca --server-address=127.0.0.1:443 --server-address-refresh=2s --timeout=5s --gateway-timeout-creation=1s] delay=10s timeout=10s period=10s #success=1 #failure=3
Readiness: exec [/kiam health --cert=/etc/kiam/tls/cert --key=/etc/kiam/tls/key --ca=/etc/kiam/tls/ca --server-address=127.0.0.1:443 --server-address-refresh=2s --timeout=5s --gateway-timeout-creation=1s] delay=3s timeout=10s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/kiam/tls from tls (rw)
/etc/ssl/certs from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kiam-server-token-7ztvt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tls:
Type: Secret (a volume populated by a Secret)
SecretName: kiam-server-tls
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/pki/ca-trust/extracted/pem
HostPathType:
kiam-server-token-7ztvt:
Type: Secret (a volume populated by a Secret)
SecretName: kiam-server-token-7ztvt
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 57m default-scheduler Successfully assigned infrastructure/kiam-server-9kn8g to ip-xxx.xxx.xxx.xxx.eu-west-1.compute.internal
Warning Unhealthy 57m kubelet Readiness probe failed: time="2021-07-28T08:51:54Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 57m kubelet Liveness probe failed: time="2021-07-28T08:51:58Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 57m kubelet Readiness probe failed: time="2021-07-28T08:52:04Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 57m kubelet Liveness probe failed: time="2021-07-28T08:52:08Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 57m kubelet Readiness probe failed: time="2021-07-28T08:52:14Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 57m kubelet Liveness probe failed: time="2021-07-28T08:52:18Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 57m kubelet Readiness probe failed: time="2021-07-28T08:52:24Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 57m kubelet Readiness probe failed: time="2021-07-28T08:52:34Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning Unhealthy 56m kubelet Liveness probe failed: time="2021-07-28T08:52:38Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Normal Pulled 56m (x3 over 57m) kubelet Container image "quay.io/uswitch/kiam:v3.4" already present on machine
Normal Killing 56m (x2 over 57m) kubelet Container kiam-server failed liveness probe, will be restarted
Normal Created 56m (x3 over 57m) kubelet Created container kiam-server
Normal Started 56m (x3 over 57m) kubelet Started container kiam-server
Warning Unhealthy 17m (x99 over 56m) kubelet (combined from similar events): Readiness probe failed: time="2021-07-28T09:31:54Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Warning BackOff 2m51s (x198 over 53m) kubelet Back-off restarting failed container
logs always end with the following
{"level":"info","msg":"stopping server","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopping prometheus metric listener","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopping credential manager process 0","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopped","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopping credential manager process 7","time":"2021-07-28T12:44:38Z"}
Even restarting the agent does not work, that one will not start afterwards either.
Kiam agent logs:
{"level":"info","msg":"configuring iptables","time":"2021-07-28T12:47:51Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2021-07-28T12:47:51Z"}
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2021-07-28T12:47:52Z"}
Update 1: I tried destroying and reapplying the chart, no difference.
@efowel I fixed the issue. It has probably something to do with the TLS certificates, at least that was it in my case. I deleted the secrets, recreated them. Removed the KIAM chart and reapplied it. Now it is all functioning again! I hope this helps you and anyone else in the future!
@cwijnekus thanks for the update, I was about to ask related to certificate, I was suspecting the same issue, in our case we have cert-manager that auto renew the kiam certificate that's why we have this problem on a regular basis. Sadly I have not prove that yet though during my tests. Improvement we did is to change the health check of kiam agents from
path: /ping to path: /health?deep=anything
We have a working kiam-server to agent that we use in our cluster but every time the server restarts itself (not sure what reason), we get in trouble and our apps cant contact aws resources unless we restart the kiam-agent pod as well. Why is kiam-server terminates with Completed status?
Image: quay.io/uswitch/kiam:v3.0