uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Health check fails #40

Closed kevtaylor closed 6 years ago

kevtaylor commented 6 years ago

The health check of the kiam-server is failing with the following

/etc/kiam/tls # /health --key server-key.pem --ca ca.pem --cert server.pem --server-address=localhost:443
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
FATA[0001] error retrieving health: rpc error: code = Unavailable desc = there is no connection available

Is there something else required to make the server work?

kevtaylor commented 6 years ago
/etc/kiam/tls # GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 /health --key server-key.pem --ca ca.pem --cert server.pem --server-address="localhost:443"
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc42011cde0
INFO: 2018/02/28 17:00:06 dialing to target with scheme: ""
INFO: 2018/02/28 17:00:06 could not get resolver for scheme: ""
INFO: 2018/02/28 17:00:06 balancerWrapper: is pickfirst: false
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
INFO: 2018/02/28 17:00:06 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 10.29.18.54:53: no such host.
INFO: 2018/02/28 17:00:06 balancerWrapper: got update addr from Notify: [{127.0.0.1:443 <nil>} {[::1]:443 <nil>}]
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: new subconn: [{127.0.0.1:443 0  <nil>}]
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: new subconn: [{[::1]:443 0  <nil>}]
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aa80, CONNECTING
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42011cde0
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aad0, CONNECTING
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42011cde0
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aad0, TRANSIENT_FAILURE
WARNING: 2018/02/28 17:00:06 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp [::1]:443: connect: cannot assign requested address"; Reconnecting to {[::1]:443 0  <nil>}
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42011cde0
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aad0, CONNECTING
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42011cde0
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aad0, TRANSIENT_FAILURE
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42011cde0
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aa80, TRANSIENT_FAILURE
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc42011cde0
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aa80, CONNECTING
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc42011cde0
WARNING: 2018/02/28 17:00:06 Failed to dial 127.0.0.1:443: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for kiam, not localhost:443"; please retry.
INFO: 2018/02/28 17:00:06 balancerWrapper: handle subconn state change: 0xc42035aa80, SHUTDOWN
INFO: 2018/02/28 17:00:06 ccBalancerWrapper: updating state and picker called by balancer: TRANSIENT_FAILURE, 0xc42011cde0
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
FATA[0001] error retrieving health: rpc error: code = Unavailable desc = there is no connection available
pingles commented 6 years ago

D'oh.. same problem, sorry that it's not documented/implemented better to avoid:

WARNING: 2018/02/28 17:00:06 Failed to dial 127.0.0.1:443: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for kiam, not localhost:443"; please retry.

It's probably that the JSON used to generate the certs is missing the alternative subject names.

kevtaylor commented 6 years ago

Oh - okay - I generated the certs by hand from our vault installation - I'll check that out. Thanks At least I know how to do a debug trace now - so that's very useful

pingles commented 6 years ago

Yeah, unfortunately the error message isn't particularly good at explaining the reason. I'll probably rewrite this issue to make cert issues less likely to go wrong (ie. use cert-manager) and improve error messages. For now I'll leave this open to remind me to do that :) Thanks for reporting!

kevtaylor commented 6 years ago

Thanks - when I used your detailed tooling to define the certs and compare - I can see all the alternative names which I haven't put in my certs. I also used an ec type instead of rsa, so I am setting up so that it matches.

kevtaylor commented 6 years ago

I managed to get over the host issue but now I get

WARNING: 2018/03/01 16:05:17 Failed to dial 127.0.0.1:443: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

Is there anything I can do to validate the cert and what might be wrong with it?

kevtaylor commented 6 years ago

I think I have issues with my cert chain still (unable to get local issuer certificate) - this isn't your problem, so I am closing this issue. i tried it with your generated certs from your docs page and it works so I'll revisit my broken chain