uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Kiam-Server Terminates with Completed Status #461

Open efowel opened 3 years ago

efowel commented 3 years ago

We have a working kiam-server to agent that we use in our cluster but every time the server restarts itself (not sure what reason), we get in trouble and our apps cant contact aws resources unless we restart the kiam-agent pod as well. Why is kiam-server terminates with Completed status?

Image: quay.io/uswitch/kiam:v3.0


Name:               kiam-server-fzp7s
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Start Time:         Thu, 10 Dec 2020 17:45:00 +0800
Labels:             app=kiam
                    controller-revision-hash=2751008331
                    pod-template-generation=5
                    role=server
Annotations:        prometheus.io/port: 9620
                    prometheus.io/scrape: true
Status:             Running
IP:                 100.107.202.68
Controlled By:      DaemonSet/kiam-server
Containers:
  kiam:
    Container ID:  docker://d0f3a9b576497d237863f4d4e471050dc2009913cae6e54fd33535a789bb7f35
    Image:         quay.io/uswitch/kiam:v3.0
    Image ID:      docker-pullable://quay.io/uswitch/kiam@sha256:0121c7d0af1c11480ae9bf267dfe6eb282532541c74ba885f436eb5f967acd67
    Port:          <none>
    Host Port:     <none>
    Command:
      /kiam
    Args:
      server
      --json-log
      --level=warn
      --bind=0.0.0.0:443
      --cert=/etc/kiam/tls/tls.crt
      --key=/etc/kiam/tls/tls.key
      --ca=/etc/kiam/tls/ca.crt
      --role-base-arn-autodetect
      --sync=1m
      --prometheus-listen-addr=0.0.0.0:9620
      --prometheus-sync-interval=5s
    State:          Running
      Started:      Mon, 08 Feb 2021 12:27:06 +0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 10 Dec 2020 17:45:01 +0800
      Finished:     Mon, 08 Feb 2021 12:27:06 +0800
    Ready:          True
    Restart Count:  1
    Liveness:       exec [/kiam health --cert=/etc/kiam/tls/tls.crt --key=/etc/kiam/tls/tls.key --ca=/etc/kiam/tls/ca.crt --server-address=127.0.0.1:443 --gateway-timeout-creation=1s --timeout=5s] delay=10s timeout=10s period=10s #success=1 #failure=3
    Readiness:      exec [/kiam health --cert=/etc/kiam/tls/tls.crt --key=/etc/kiam/tls/tls.key --ca=/etc/kiam/tls/ca.crt --server-address=127.0.0.1:443 --gateway-timeout-creation=1s --timeout=5s] delay=3s timeout=10s period=10s #success=1 #failure=5
    Environment:    <none>
    Mounts:
      /etc/kiam/tls from tls (rw)
      /etc/ssl/certs from ssl-certs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kiam-server-token-mz9cj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs
    HostPathType:  
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kiam-server-cert
    Optional:    false
  kiam-server-token-mz9cj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kiam-server-token-mz9cj
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kubernetes.io/role=master
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:          <none>
cwijnekus commented 3 years ago

I am experiencing the same problem. We are running version 3.4. We also have this in 2 different clusters, but only experiencing this in one of them.

Name:                 kiam-server-9kn8g
Namespace:            infrastructure
Priority:             90000000
Priority Class Name:  node-critical-priority
Node:                 ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal/xxx.xxx.xxx.xxx
Start Time:           Wed, 28 Jul 2021 10:51:42 +0200
Labels:               app=kiam
                      component=server
                      controller-revision-hash=59768d7cb6
                      pod-template-generation=1
                      release=kiam
Annotations:          fluentbit.io/exclude: true
                      kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   xxx.xxx.xxx.xxx
IPs:
  IP:           xxx.xxx.xxx.xxx
Controlled By:  DaemonSet/kiam-server
Containers:
  kiam-server:
    Container ID:  docker://a78002b145cf9cccec54a822536654581e15a0d24d863fe47f88219e3722809d
    Image:         quay.io/uswitch/kiam:v3.4
    Image ID:      docker-pullable://quay.io/uswitch/kiam@sha256:b24ef28b4a06371d10b6b9fea8a2d0a3b342dbf4928e798c07a208995e3945e3
    Port:          <none>
    Host Port:     <none>
    Command:
      /kiam
      server
    Args:
      --json-log
      --level=info
      --bind=0.0.0.0:443
      --cert=/etc/kiam/tls/cert
      --key=/etc/kiam/tls/key
      --ca=/etc/kiam/tls/ca
      --role-base-arn-autodetect
      --session-duration=15m
      --sync=1m
      --prometheus-listen-addr=0.0.0.0:9620
      --prometheus-sync-interval=5s
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 28 Jul 2021 11:44:59 +0200
      Finished:     Wed, 28 Jul 2021 11:45:38 +0200
    Ready:          False
    Restart Count:  21
    Requests:
      cpu:        100m
      memory:     100Mi
    Liveness:     exec [/kiam health --cert=/etc/kiam/tls/cert --key=/etc/kiam/tls/key --ca=/etc/kiam/tls/ca --server-address=127.0.0.1:443 --server-address-refresh=2s --timeout=5s --gateway-timeout-creation=1s] delay=10s timeout=10s period=10s #success=1 #failure=3
    Readiness:    exec [/kiam health --cert=/etc/kiam/tls/cert --key=/etc/kiam/tls/key --ca=/etc/kiam/tls/ca --server-address=127.0.0.1:443 --server-address-refresh=2s --timeout=5s --gateway-timeout-creation=1s] delay=3s timeout=10s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/kiam/tls from tls (rw)
      /etc/ssl/certs from ssl-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kiam-server-token-7ztvt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kiam-server-tls
    Optional:    false
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/pki/ca-trust/extracted/pem
    HostPathType:
  kiam-server-token-7ztvt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kiam-server-token-7ztvt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  57m                    default-scheduler  Successfully assigned infrastructure/kiam-server-9kn8g to ip-xxx.xxx.xxx.xxx.eu-west-1.compute.internal
  Warning  Unhealthy  57m                    kubelet            Readiness probe failed: time="2021-07-28T08:51:54Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  57m                    kubelet            Liveness probe failed: time="2021-07-28T08:51:58Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  57m                    kubelet            Readiness probe failed: time="2021-07-28T08:52:04Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  57m                    kubelet            Liveness probe failed: time="2021-07-28T08:52:08Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  57m                    kubelet            Readiness probe failed: time="2021-07-28T08:52:14Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  57m                    kubelet            Liveness probe failed: time="2021-07-28T08:52:18Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  57m                    kubelet            Readiness probe failed: time="2021-07-28T08:52:24Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  57m                    kubelet            Readiness probe failed: time="2021-07-28T08:52:34Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  Unhealthy  56m                    kubelet            Liveness probe failed: time="2021-07-28T08:52:38Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Normal   Pulled     56m (x3 over 57m)      kubelet            Container image "quay.io/uswitch/kiam:v3.4" already present on machine
  Normal   Killing    56m (x2 over 57m)      kubelet            Container kiam-server failed liveness probe, will be restarted
  Normal   Created    56m (x3 over 57m)      kubelet            Created container kiam-server
  Normal   Started    56m (x3 over 57m)      kubelet            Started container kiam-server
  Warning  Unhealthy  17m (x99 over 56m)     kubelet            (combined from similar events): Readiness probe failed: time="2021-07-28T09:31:54Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
  Warning  BackOff    2m51s (x198 over 53m)  kubelet            Back-off restarting failed container

logs always end with the following

{"level":"info","msg":"stopping server","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopping prometheus metric listener","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopping credential manager process 0","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopped","time":"2021-07-28T12:44:38Z"}
{"level":"info","msg":"stopping credential manager process 7","time":"2021-07-28T12:44:38Z"}

Even restarting the agent does not work, that one will not start afterwards either.

Kiam agent logs:

{"level":"info","msg":"configuring iptables","time":"2021-07-28T12:47:51Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2021-07-28T12:47:51Z"}
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2021-07-28T12:47:52Z"}

Update 1: I tried destroying and reapplying the chart, no difference.

cwijnekus commented 3 years ago

@efowel I fixed the issue. It has probably something to do with the TLS certificates, at least that was it in my case. I deleted the secrets, recreated them. Removed the KIAM chart and reapplied it. Now it is all functioning again! I hope this helps you and anyone else in the future!

efowel commented 3 years ago

@cwijnekus thanks for the update, I was about to ask related to certificate, I was suspecting the same issue, in our case we have cert-manager that auto renew the kiam certificate that's why we have this problem on a regular basis. Sadly I have not prove that yet though during my tests. Improvement we did is to change the health check of kiam agents from

path: /ping to path: /health?deep=anything