uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

v2.4: readiness probe failing for kiam-server #92

Closed tasdikrahman closed 6 years ago

tasdikrahman commented 6 years ago

Greetings, I witnessed something similar to https://github.com/uswitch/kiam/issues/52

Then it was a problem with our service endpoints which we fixed, we brought up a new cluster with the same configurations after which we are facing the issue of the kiam-server's readiness probe failing (the issue last time was that the kiam-agent was not able to communicate, this time the kiam-server isn't coming up)

For debugging, I removed the liveness and readiness probe on the kiam-server and tried adding the GRPC env vars for the health binary to dump any relevant logs to which I got this

$ kpssh kiam-server-5tjd5 kube-system  
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert=/etc/kiam/tls/kiam-server.crt --key=/etc/kiam/tls/kiam-server.key --ca=/etc/kiam/tls/kiam-ca.crt --s
erver-address=localhost:443 --server-address-refresh=2s --timeout=5s
INFO: 2018/06/15 01:33:36 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4202ed320
INFO: 2018/06/15 01:33:36 dialing to target with scheme: ""
INFO: 2018/06/15 01:33:36 could not get resolver for scheme: ""
INFO: 2018/06/15 01:33:36 balancerWrapper: is pickfirst: false
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
...
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no address available
FATA[0005] error retrieving health: rpc error: code = Unavailable desc = there is no address available

The service endpoint for kube-dns would be

$ kubectl get svc -n kube-system                     
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
kiam-server   ClusterIP   None         <none>        443/TCP         10h
kube-dns      ClusterIP   10.3.0.10    <none>        53/UDP,53/TCP   11h

We are using kube-dns along with kube-router (with kube-bridge)

Please do let me know if you need anything else for logs.

Pasting the manifests here for sanity

kiam-server service

$ kubectl get svc kiam-server -oyaml --export          
apiVersion: v1
kind: Service
metadata:
  name: kiam-server
  selfLink: /api/v1/namespaces/kube-system/services/kiam-server
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 443
    protocol: TCP
    targetPort: 443
  selector:
    k8s-app: kiam-server
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

kiam-server-ds.yaml

$ kubectl get ds kiam-server -oyaml --export                         
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  annotations:
  creationTimestamp: null
  generation: 1
  labels:
    k8s-app: kiam-server
  name: kiam-server
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/kiam-server
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kiam-server
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kiam-server
    spec:
      containers:
      - args:
        - --json-log
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/kiam-server.crt
        - --key=/etc/kiam/tls/kiam-server.key
        - --ca=/etc/kiam/tls/kiam-ca.crt
        - --role-base-arn=arn:aws:iam::<accountnumber>:role/
        - --sync=1m
        command:
        - /server
        image: quay.io/uswitch/kiam:v2.4
        imagePullPolicy: Always
        name: kiam
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/ssl/certs
          name: ssl-certs
        - mountPath: /etc/kiam/tls
          name: tls
          readOnly: true
      dnsPolicy: ClusterFirst
      nodeSelector:
        node-role.kubernetes.io/master: ""
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      volumes:
      - hostPath:
          path: /usr/share/ca-certificates
          type: ""
        name: ssl-certs
      - name: tls
        secret:
          defaultMode: 420
          secretName: kiam-server-tls
  templateGeneration: 1
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

kiam-agent-ds.yaml

$ kubectl get ds kiam-agent -oyaml --export         
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  creationTimestamp: null
  generation: 1
  labels:
    k8s-app: kiam-agent
  name: kiam-agent
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/kiam-agent
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kiam-agent
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kiam-agent
    spec:
      containers:
      - args:
        - --iptables
        - --host-interface=kube-bridge
        - --json-log
        - --port=8181
        - --cert=/etc/kiam/tls/kiam-agent.crt
        - --key=/etc/kiam/tls/kiam-agent.key
        - --ca=/etc/kiam/tls/kiam-ca.crt
        - --server-address=kiam-server:443
        command:
        - /agent
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        image: quay.io/uswitch/kiam:v2.4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /ping
            port: 8181
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1
        name: kiam-agent
        ports:
        - containerPort: 8181
          hostPort: 8181
          protocol: TCP
        resources:
          limits:
            cpu: 100m
            memory: 64Mi
          requests:
            cpu: 100m
            memory: 64Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/ssl/certs
          name: ssl-certs
        - mountPath: /etc/kiam/tls
          name: tls
          readOnly: true
        - mountPath: /var/run/xtables.lock
          name: xtables
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kiam
      serviceAccountName: kiam
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /usr/share/ca-certificates
          type: ""
        name: ssl-certs
      - name: tls
        secret:
          defaultMode: 420
          secretName: kiam-agent-tls
      - hostPath:
          path: /run/xtables.lock
          type: ""
        name: xtables
  templateGeneration: 2
  updateStrategy:
    type: OnDelete

kube-router-ds.yaml

$ kubectl get ds kube-router -n kube-system -oyaml --export                                                                                                (kluster-api/kube-system)
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  creationTimestamp: null
  generation: 1
  labels:
    k8s-app: kube-router
    tier: node
  name: kube-router
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/kube-router
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-router
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        k8s-app: kube-router
        tier: node
    spec:
      containers:
      - args:
        - --run-router=true
        - --run-firewall=true
        - --run-service-proxy=true
        - --metrics-path=/kube-router/metrics
        - --metrics-port=63330
        - --bgp-graceful-restart
        - --kubeconfig=/etc/kubernetes/kubeconfig
        - --cluster-cidr=10.2.0.0/16
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: cloudnativelabs/kube-router:v0.1.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 20244
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 15
        name: kube-router
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /etc/cni/net.d
          name: cni-conf-dir
        - mountPath: /etc/kubernetes/kubeconfig
          name: kubeconfig
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /bin/sh
        - -c
        - set -e -x; if [ ! -f /etc/cni/net.d/10-kuberouter.conf ]; then cp /etc/kube-router/cni-conf.json
          /etc/cni/net.d/.tmp-kuberouter-cfg; mv /etc/cni/net.d/.tmp-kuberouter-cfg
          /etc/cni/net.d/10-kuberouter.conf; fi
        image: busybox
        imagePullPolicy: Always
        name: install-cni
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/cni/net.d
          name: cni-conf-dir
        - mountPath: /etc/kube-router
          name: kube-router-cfg
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /etc/kubernetes/cni/net.d
          type: ""
        name: cni-conf-dir
      - configMap:
          defaultMode: 420
          name: kube-router-cfg
        name: kube-router-cfg
      - hostPath:
          path: /etc/kubernetes/kubeconfig
          type: ""
        name: kubeconfig
  templateGeneration: 1
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
---
apiVersion: v1
data:
  cni-conf.json: |
    {
      "name":"kubernetes",
      "type":"bridge",
      "bridge":"kube-bridge",
      "isDefaultGateway":true,
      "ipam": {
        "type":"host-local"
      }
    }
kind: ConfigMap
metadata:
  labels:
    k8s-app: kube-router
    tier: node
tasdikrahman commented 6 years ago

If I try to ping the kiam-server from the agent, it works. But this would not necessarily mean that everything is working as the health check is there for a reason, which would rule out the earlier issue of https://github.com/uswitch/kiam/issues/52 if I am not wrong?

$ kpssh kiam-agent-2jgnl kube-system                        
/ # ping kiam-server
PING kiam-server (10.2.0.11): 56 data bytes
64 bytes from 10.2.0.11: seq=0 ttl=63 time=0.763 ms
64 bytes from 10.2.0.11: seq=1 ttl=63 time=0.887 ms
64 bytes from 10.2.0.11: seq=2 ttl=63 time=0.893 ms
64 bytes from 10.2.0.11: seq=3 ttl=63 time=0.870 ms
^C
--- kiam-server ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.763/0.853/0.893 ms
/ # exit

I tried checking if this was a cert issue, by changing the serveraddress from localhost to 127.0.0.1 which gave the tracelog of

$ kpssh kiam-server-5tjd5 kube-system                                                                                                                      (kluster-api/kube-system)
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert=/etc/kiam/tls/kiam-server.crt --key=/etc/kiam/tls/kiam-server.key --ca=/etc/kiam/tls/kiam-ca.crt --s
erver-address=127.0.0.1:443 --server-address-refresh=2s --timeout=5s
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc42006e9c0
INFO: 2018/06/15 01:53:40 dialing to target with scheme: ""
INFO: 2018/06/15 01:53:40 could not get resolver for scheme: ""
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
INFO: 2018/06/15 01:53:40 balancerWrapper: is pickfirst: false
INFO: 2018/06/15 01:53:40 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>}]
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0  <nil>}]
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, CONNECTING
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42006e9c0
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, TRANSIENT_FAILURE
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc42006e9c0
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, CONNECTING
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42006e9c0
WARNING: 2018/06/15 01:53:40 Failed to dial 127.0.0.1:443: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost:443, kiam-server:443, localhost:9610, not 127.0.0.1:443"; please retry.
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, SHUTDOWN
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc42006e9c0
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
....
....
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no connection available
FATA[0005] error retrieving health: rpc error: code = Unavailable desc = there is no connection available

I could deduce that the certs are not the issue here, if I am not wrong(?)

doing a netstat inside one of the kiam-server pods gave the following

/ # netstat -plant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.2.0.11:36144         10.3.0.1:443            ESTABLISHED 1/server
tcp        0      0 :::443                  :::*                    LISTEN      1/server
tcp        0      0 ::ffff:10.2.0.11:443    ::ffff:10.1.4.13:39832  ESTABLISHED 1/server
tcp        0      0 ::ffff:10.2.0.11:443    ::ffff:10.1.4.12:52018  ESTABLISHED 1/server
tcp        0      0 ::ffff:10.2.0.11:443    ::ffff:10.1.4.21:47526  ESTABLISHED 1/server

Was cross checking if the IP's showing up were of agents running or not, which is true in this case.

$ kgpn kube-system | grep kiam               
kiam-agent-2jgnl                                                  1/1       Running   0          11h       10.1.4.12   ip-10-1-4-12.ap-south-1.compute.internal
kiam-agent-qd84s                                                  1/1       Running   0          11h       10.1.4.13   ip-10-1-4-13.ap-south-1.compute.internal
kiam-agent-qk4z9                                                  1/1       Running   0          11h       10.1.4.21   ip-10-1-4-21.ap-south-1.compute.internal
kiam-server-5tjd5                                                 1/1       Running   0          10h       10.2.0.11   ip-10-1-8-22.ap-south-1.compute.internal
kiam-server-6dhm4                                                 1/1       Running   0          10h       10.2.5.8    ip-10-1-8-12.ap-south-1.compute.internal
kiam-server-7c5zd                                                 1/1       Running   0          10h       10.2.6.8    ip-10-1-8-10.ap-south-1.compute.internal
kiam-server-fksqw                                                 1/1       Running   0          10h       10.2.9.10   ip-10-1-8-28.ap-south-1.compute.internal
kiam-server-ftvpm                                                 1/1       Running   0          10h       10.2.7.8    ip-10-1-8-4.ap-south-1.compute.internal

I tried the same thing in an older cluster with the exact same manifest files where we have kiam running

$ kpssh kiam-server-2fhhq kube-system               (kluster-qa/kube-system)
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert=/etc/kiam/tls/kiam-server.crt --key=/etc/kiam/tls/kiam-server.key --ca=/etc/kiam/tls/kiam-ca.crt --s
erver-address=localhost:443 --server-address-refresh=2s --timeout=5s
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203f7320
INFO: 2018/06/15 01:52:13 dialing to target with scheme: ""
INFO: 2018/06/15 01:52:13 could not get resolver for scheme: ""
INFO: 2018/06/15 01:52:13 balancerWrapper: is pickfirst: false
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
INFO: 2018/06/15 01:52:13 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 10.3.0.10:53: no such host.
INFO: 2018/06/15 01:52:13 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>} {[::1]:443 <nil>}]
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0  <nil>}]
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: new subconn: [{[::1]:443 0  <nil>}]
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
WARNING: 2018/06/15 01:52:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp [::1]:443: connect: cannot assign requested address"; Reconnecting to {[::1]:443 0  <nil>}
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, TRANSIENT_FAILURE
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, TRANSIENT_FAILURE
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, TRANSIENT_FAILURE
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, READY
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4203f7320
INFO[0000] healthy: ok
/ #

Please let me know if you want me to provide me with any other logs. :)

tasdikrahman commented 6 years ago

I figured it's an issue with kube-router itself, opening up ingress in security group's between worker and master worked.