Closed tasdikrahman closed 6 years ago
If I try to ping the kiam-server from the agent, it works. But this would not necessarily mean that everything is working as the health check is there for a reason, which would rule out the earlier issue of https://github.com/uswitch/kiam/issues/52 if I am not wrong?
$ kpssh kiam-agent-2jgnl kube-system
/ # ping kiam-server
PING kiam-server (10.2.0.11): 56 data bytes
64 bytes from 10.2.0.11: seq=0 ttl=63 time=0.763 ms
64 bytes from 10.2.0.11: seq=1 ttl=63 time=0.887 ms
64 bytes from 10.2.0.11: seq=2 ttl=63 time=0.893 ms
64 bytes from 10.2.0.11: seq=3 ttl=63 time=0.870 ms
^C
--- kiam-server ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.763/0.853/0.893 ms
/ # exit
I tried checking if this was a cert issue, by changing the serveraddress from localhost
to 127.0.0.1
which gave the tracelog of
$ kpssh kiam-server-5tjd5 kube-system (kluster-api/kube-system)
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert=/etc/kiam/tls/kiam-server.crt --key=/etc/kiam/tls/kiam-server.key --ca=/etc/kiam/tls/kiam-ca.crt --s
erver-address=127.0.0.1:443 --server-address-refresh=2s --timeout=5s
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc42006e9c0
INFO: 2018/06/15 01:53:40 dialing to target with scheme: ""
INFO: 2018/06/15 01:53:40 could not get resolver for scheme: ""
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
INFO: 2018/06/15 01:53:40 balancerWrapper: is pickfirst: false
INFO: 2018/06/15 01:53:40 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>}]
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0 <nil>}]
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, CONNECTING
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42006e9c0
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, TRANSIENT_FAILURE
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc42006e9c0
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, CONNECTING
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42006e9c0
WARNING: 2018/06/15 01:53:40 Failed to dial 127.0.0.1:443: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost:443, kiam-server:443, localhost:9610, not 127.0.0.1:443"; please retry.
INFO: 2018/06/15 01:53:40 balancerWrapper: handle subconn state change: 0xc420404aa0, SHUTDOWN
INFO: 2018/06/15 01:53:40 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc42006e9c0
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
....
....
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0004] error checking health: rpc error: code = Unavailable desc = there is no connection available
FATA[0005] error retrieving health: rpc error: code = Unavailable desc = there is no connection available
I could deduce that the certs are not the issue here, if I am not wrong(?)
doing a netstat inside one of the kiam-server pods gave the following
/ # netstat -plant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 10.2.0.11:36144 10.3.0.1:443 ESTABLISHED 1/server
tcp 0 0 :::443 :::* LISTEN 1/server
tcp 0 0 ::ffff:10.2.0.11:443 ::ffff:10.1.4.13:39832 ESTABLISHED 1/server
tcp 0 0 ::ffff:10.2.0.11:443 ::ffff:10.1.4.12:52018 ESTABLISHED 1/server
tcp 0 0 ::ffff:10.2.0.11:443 ::ffff:10.1.4.21:47526 ESTABLISHED 1/server
Was cross checking if the IP's showing up were of agents running or not, which is true in this case.
$ kgpn kube-system | grep kiam
kiam-agent-2jgnl 1/1 Running 0 11h 10.1.4.12 ip-10-1-4-12.ap-south-1.compute.internal
kiam-agent-qd84s 1/1 Running 0 11h 10.1.4.13 ip-10-1-4-13.ap-south-1.compute.internal
kiam-agent-qk4z9 1/1 Running 0 11h 10.1.4.21 ip-10-1-4-21.ap-south-1.compute.internal
kiam-server-5tjd5 1/1 Running 0 10h 10.2.0.11 ip-10-1-8-22.ap-south-1.compute.internal
kiam-server-6dhm4 1/1 Running 0 10h 10.2.5.8 ip-10-1-8-12.ap-south-1.compute.internal
kiam-server-7c5zd 1/1 Running 0 10h 10.2.6.8 ip-10-1-8-10.ap-south-1.compute.internal
kiam-server-fksqw 1/1 Running 0 10h 10.2.9.10 ip-10-1-8-28.ap-south-1.compute.internal
kiam-server-ftvpm 1/1 Running 0 10h 10.2.7.8 ip-10-1-8-4.ap-south-1.compute.internal
I tried the same thing in an older cluster with the exact same manifest files where we have kiam running
$ kpssh kiam-server-2fhhq kube-system (kluster-qa/kube-system)
/ # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --cert=/etc/kiam/tls/kiam-server.crt --key=/etc/kiam/tls/kiam-server.key --ca=/etc/kiam/tls/kiam-ca.crt --s
erver-address=localhost:443 --server-address-refresh=2s --timeout=5s
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4203f7320
INFO: 2018/06/15 01:52:13 dialing to target with scheme: ""
INFO: 2018/06/15 01:52:13 could not get resolver for scheme: ""
INFO: 2018/06/15 01:52:13 balancerWrapper: is pickfirst: false
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
INFO: 2018/06/15 01:52:13 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 10.3.0.10:53: no such host.
INFO: 2018/06/15 01:52:13 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>} {[::1]:443 <nil>}]
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0 <nil>}]
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: new subconn: [{[::1]:443 0 <nil>}]
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
WARNING: 2018/06/15 01:52:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp [::1]:443: connect: cannot assign requested address"; Reconnecting to {[::1]:443 0 <nil>}
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, TRANSIENT_FAILURE
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c2350, TRANSIENT_FAILURE
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, TRANSIENT_FAILURE
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, CONNECTING
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4203f7320
INFO: 2018/06/15 01:52:13 balancerWrapper: handle subconn state change: 0xc4202c22d0, READY
INFO: 2018/06/15 01:52:13 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4203f7320
INFO[0000] healthy: ok
/ #
Please let me know if you want me to provide me with any other logs. :)
I figured it's an issue with kube-router itself, opening up ingress in security group's between worker and master worked.
Greetings, I witnessed something similar to https://github.com/uswitch/kiam/issues/52
Then it was a problem with our service endpoints which we fixed, we brought up a new cluster with the same configurations after which we are facing the issue of the kiam-server's readiness probe failing (the issue last time was that the kiam-agent was not able to communicate, this time the kiam-server isn't coming up)
For debugging, I removed the liveness and readiness probe on the
kiam-server
and tried adding the GRPC env vars for thehealth
binary to dump any relevant logs to which I got thisThe service endpoint for kube-dns would be
We are using kube-dns along with kube-router (with
kube-bridge
)Please do let me know if you need anything else for logs.
Pasting the manifests here for sanity
kiam-server service
kiam-server-ds.yaml
kiam-agent-ds.yaml
kube-router-ds.yaml