uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Agent stuck with log "resolver returned no addresses" #425

Closed avdhoot closed 3 years ago

avdhoot commented 4 years ago

Some of the kiam-agent does not return metadata & emit below logs. We also monitored network activity of agent it never makes new DNS query.

Agent logs: {"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = resolver returned no addresses","pod.ip":"100.121.119.112","time":"2020-09-16T09:51:07Z"}

To fix this we delete kiam-agent pod.

BartoszZawadzki commented 3 years ago

We are facing the same issue as above:

kiam-agent-2vb9x kiam-agent {"addr":"100.96.1.231:44236","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unavailable desc = resolver returned no addresses","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2020-10-14T12:48:06Z"}

kiam v3.6 kubernetesVersion: 1.18.9 kops 1.18.1 (git-453d7d96be)

And yes, killing agent pods seemed to resolve the issue (for now)

pingles commented 3 years ago

The error means that the agent pod's gRPC client is unable to find any available server pods for it to connect to. It uses gRPC's client load-balancer. There's some gRPC environment variables you can add for it to report debugging information as to what's going on but the problem is that there's no running server pods and it wasn't able to find them in time.

dza89 commented 3 years ago

We are facing this issue as well. All kiam agents became unresponsive until a restart. kiam v3.6 kubernetesVersion: 1.17.eks.3

@pingles That makes sense, but shouldn't it lead to an application exit if it's 'stuck'?

pingles commented 3 years ago

It depends where it happens. Most operations are set to retry in the event of failures (assuming some intermittent pod issues or DNS resolution for example) up to an amount of time or the operation is cancelled.

Other issues have highlighted similar problems and you can enable some env vars to increase the verbosity of gRPC's logging. https://github.com/uswitch/kiam/issues/217#issuecomment-521933917

Could you try that and see what that shows please.

dagvl commented 3 years ago

I am also observing this problem. Here are some detailed logs:

Logs from kiam-agent before problem:

➜ k logs -f kiam-agent-5n76g
{"level":"info","msg":"configuring iptables","time":"2020-10-20T21:58:12Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-10-20T21:58:12Z"}
INFO: 2020/10/20 21:58:12 parsed scheme: "dns"
INFO: 2020/10/20 21:58:12 ccResolverWrapper: sending update to cc: {[{10.121.64.56:443  <nil> 0 <nil>} {10.121.10.175:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/20 21:58:12 base.baseBalancer: got new ClientConn state:  {{[{10.121.64.56:443  <nil> 0 <nil>} {10.121.10.175:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/20 21:58:12 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 21:58:12 base.baseBalancer: handle SubConn state change: 0xc0000aed90, CONNECTING
INFO: 2020/10/20 21:58:12 base.baseBalancer: handle SubConn state change: 0xc0000aed70, READY

As you can see, it has started and got 10.121.64.56 and 10.121.10.175 as kiam-server pod ips. This matches reality:

➜ kgp -o wide | grep kiam-server
kiam-server-569cbfc89f-pc6t6                                 1/1     Running     0          3m26s   10.121.64.56     ip-10-121-105-32.eu-north-1.compute.internal    <none>           <none>
kiam-server-569cbfc89f-s2n88                                 1/1     Running     0          3m26s   10.121.10.175    ip-10-121-17-163.eu-north-1.compute.internal    <none>           <none>

I then scale down kiam-server to 0, and get these logs from the same kiam-agent:

INFO: 2020/10/20 21:59:50 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 21:59:50 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/20 21:59:50 roundrobinPicker: newPicker called with info: {map[0xc0000aed90:{{10.121.10.175:443  <nil> 0 <nil>}}]}
WARNING: 2020/10/20 21:59:50 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: connection refused". Reconnecting...
INFO: 2020/10/20 21:59:50 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 21:59:50 ccResolverWrapper: sending update to cc: {[{10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/20 21:59:50 base.baseBalancer: got new ClientConn state:  {{[{10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/20 21:59:50 base.baseBalancer: handle SubConn state change: 0xc0000aed90, SHUTDOWN
INFO: 2020/10/20 21:59:50 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/20 21:59:51 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 21:59:51 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:01 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:01 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:03 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 22:00:03 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:06 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:06 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:08 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 22:00:08 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:09 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:09 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:13 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 22:00:13 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:15 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:15 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:20 ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/20 22:00:20 base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/10/20 22:00:20 base.baseBalancer: handle SubConn state change: 0xc0000aed70, SHUTDOWN

After the SHUTDOWN message, no more detailed logs like this.

I then scale up kiam-server again, note new pod ips:

➜ kgp -o wide | grep kiam-server
kiam-server-569cbfc89f-bq7nx                                 1/1     Running     0          43s     10.121.40.62     ip-10-121-17-163.eu-north-1.compute.internal    <none>           <none>
kiam-server-569cbfc89f-qz2pl                                 1/1     Running     0          43s     10.121.103.122   ip-10-121-105-32.eu-north-1.compute.internal    <none>           <none>

Port-forwarding and hitting health endpoint on the kiam-agent provokes following http response and error in log:

➜ curl -i "http://localhost:8181/health?deep=1"
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 20 Oct 2020 22:01:11 GMT
Content-Length: 148

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host"
{"addr":"127.0.0.1:50808","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host\"","path":"/health","status":500,"time":"2020-10-20T22:01:11Z"}
{"addr":"127.0.0.1:50808","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/health","status":500,"time":"2020-10-20T22:01:11Z"}

Note that the pod ip in the error message is one of the original ips from before the kiam-server scale down.

For me it looks like gRPC stops receiving or polling for updated endpoint IPs after 30 seconds of failures.

versions: k8s: v1.15.11-eks-065dce kiam: quay.io/uswitch/kiam:v3.6

pingles commented 3 years ago

Interesting, thanks for adding the detailed info.

pingles commented 3 years ago

I'll try and take a look when I can but if anyone else can help I'm happy for us to take a look at PRs too.

dagvl commented 3 years ago

So I don't have a smoking gun, but I have some observations that might be helpful in zeroing in.

So kiam uses the dns resolver. In grpc < 1.28.0 it seems like the functioning of this was that it re-resolved DNS on a connection error, and every 30 minutes (I don't have a hard reference for this, just a lot of snippets from different discussions around on different sites). In the case of a connection's subconnections all being in the SHUTDOWN state, there never was a state transition triggered to do a re-resolution. This was a bug reported in 1.26 (https://github.com/grpc/grpc-go/issues/3353)

This bug was never resolved per se, but in 1.28 (https://github.com/grpc/grpc-go/pull/3165) they removed DNS polling all together, and also made the re-poll on transitions more robust.

I tried updating this dependency locally (go get google.golang.org/grpc@v1.28.0) and I can NOT consistently reproduce the same issue now. I've tried my reproduction routing about 5 times now, and it I've only seen it 1 time. The other 4 times the SHUTDOWN message never came and kiam-agent continued trying to connect to the old pod IPs.

I also tried with the newest grpc (1.33.1) and it seems to behave better there also. Here's an annotated example:

INFO: 2020/10/21 16:37:05 [core] parsed scheme: "dns"
INFO: 2020/10/21 16:37:05 [core] ccResolverWrapper: sending update to cc: {[{10.121.40.62:443  <nil> 0 <nil>} {10.121.96.28:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: got new ClientConn state:  {{[{10.121.40.62:443  <nil> 0 <nil>} {10.121.96.28:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:05 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
INFO: 2020/10/21 16:37:05 [core] Channel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:05 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:05 [core] CPU time info is unavailable on non-linux or appengine environment.
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, READY
INFO: 2020/10/21 16:37:05 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000337590:{{10.121.96.28:443  <nil> 0 <nil>}}]}
INFO: 2020/10/21 16:37:05 [core] Channel Connectivity change to READY
INFO[0000] listening :8181
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, READY
INFO: 2020/10/21 16:37:05 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000337530:{{10.121.40.62:443  <nil> 0 <nil>}} 0xc000337590:{{10.121.96.28:443  <nil> 0 <nil>}}]}
INFO: 2020/10/21 16:37:19 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:19 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
INFO: 2020/10/21 16:37:19 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:19 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000337590:{{10.121.96.28:443  <nil> 0 <nil>}}]}

Here I remove the kiam server pods

INFO: 2020/10/21 16:37:19 [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/10/21 16:37:19 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:19 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:19 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:20 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:20 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:20 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:23 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:23 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:23 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:24 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:24 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:24 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:26 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:26 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:26 [roundrobin] roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/21 16:37:26 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:26 [core] Channel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:26 [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/10/21 16:37:26 [core] grpc: addrConn.createTransport failed to connect to {10.121.96.28:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:26 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [core] Channel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:27 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:27 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:27 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:28 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:28 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:28 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:30 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:30 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:30 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
WARNING: 2020/10/21 16:37:30 [core] grpc: addrConn.createTransport failed to connect to {10.121.96.28:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:30 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:30 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:31 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:31 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:31 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:33 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:33 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:33 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:35 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:35 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:35 [core] ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:37:35 [core] Subchannel Connectivity change to SHUTDOWN
INFO: 2020/10/21 16:37:35 [core] Subchannel Connectivity change to SHUTDOWN
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, SHUTDOWN
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, SHUTDOWN
WARNING: 2020/10/21 16:37:35 [core] grpc: addrConn.createTransport failed to connect to {10.121.96.28:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...

Here i start the kiam-server pods again, at about 16:38:00

INFO: 2020/10/21 16:38:05 [core] ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/21 16:38:05 [balancer] base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:38:35 [core] ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/21 16:38:35 [balancer] base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}

At this point kiam-agent has been "down" for about 30-60 seoncds after kiam-server pods has started. Now kiam-agent grpc seems to refresh from DNS

INFO: 2020/10/21 16:39:06 [core] ccResolverWrapper: sending update to cc: {[{10.121.36.179:443  <nil> 0 <nil>} {10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: got new ClientConn state:  {{[{10.121.36.179:443  <nil> 0 <nil>} {10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:39:06 [core] Subchannel picks a new address "10.121.64.56:443" to connect
INFO: 2020/10/21 16:39:06 [core] Subchannel picks a new address "10.121.36.179:443" to connect
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001170e0, CONNECTING
INFO: 2020/10/21 16:39:06 [core] Channel Connectivity change to CONNECTING
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001171d0, CONNECTING
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001171d0, READY
INFO: 2020/10/21 16:39:06 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0001171d0:{{10.121.64.56:443  <nil> 0 <nil>}}]}
INFO: 2020/10/21 16:39:06 [core] Channel Connectivity change to READY
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001170e0, READY
INFO: 2020/10/21 16:39:06 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0001170e0:{{10.121.36.179:443  <nil> 0 <nil>}} 0xc0001171d0:{{10.121.64.56:443  <nil> 0 <nil>}}]}

In short, it seems that grpc 1.28 or newer at least doesn't fully hang when all kiam-servers are gone at the same time, recovery does take some time though.

pingles commented 3 years ago

Fantastic!

I had started looking at refactoring some of the server code so it was easier to test and then put in a test for this situation to reliably reproduce. I’ll try that and see if it does and then look into using the updated grpc lib.

Thanks for helping to figure it out!

On Thu, 22 Oct 2020 at 00:23, Dag Viggo Lokøen notifications@github.com wrote:

So I don't have a smoking gun, but I have some observations that might be helpful in zeroing in.

So kiam uses the dns resolver. In grpc < 1.28.0 it seems like the functioning of this was that it re-resolved DNS on a connection error, and every 30 minutes (I don't have a hard reference for this, just a lot of snippets from different discussions around on different sites). In the case of a connection's subconnections all being in the SHUTDOWN state, there never was a state transition triggered to do a re-resolution. This was a bug reported in 1.26 (grpc/grpc-go#3353 https://github.com/grpc/grpc-go/issues/3353)

This bug was never resolved per se, but in 1.28 (grpc/grpc-go#3165 https://github.com/grpc/grpc-go/pull/3165) they removed DNS polling all together, and also made the re-poll on transitions more robust.

I tried updating this dependency locally (go get google.golang.org/grpc@v1.28.0) and I can NOT consistently reproduce the same issue now. I've tried my reproduction routing about 5 times now, and it I've only seen it 1 time. The other 4 times the SHUTDOWN message never came and kiam-agent continued trying to connect to the old pod IPs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uswitch/kiam/issues/425#issuecomment-713967404, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAI7S5V56LTPHAKVHFF5TSL5UPDANCNFSM4ROSMUDQ .

pingles commented 3 years ago

I've managed to get a few refactorings done (work in progress in #432). I wanted to get to a place where it's easier to write some tests and prove the behaviour.

While doing that I've found that the retry behaviour in the gRPC interceptor that's used by the client gateway doesn't set a number of retries, so the underlying operations aren't retried. https://github.com/uswitch/kiam/pull/432/commits/4d7c245d3ca74aa80658f8bd95f9814aa5ce695b#diff-947f40a8d50a4ea75b9feffc9c0b859661bdf9372582240a4b2e4be20f144368R120 is the key in question.

I don't think it's the same issue you've identified, but it may be connected.

I'm going to try and see if I can extend the tests to replicate the same situation you've documented before upgrading the gRPC release to confirm it fixes.

Not finished but thought it was worth sharing progress and what I've learned this evening 😄

pingles commented 3 years ago

Adding a comment here to help track my notes.

% GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 go test github.com/uswitch/kiam/pkg/... -run TestRetriesUntilServerAvailable

Running the tests to try and mimic the behaviour (shutting down a server after making a request) shows the following:

time="2020-10-22T22:26:31+01:00" level=info msg="started cache controller"
time="2020-10-22T22:26:31+01:00" level=info msg="started namespace cache controller"
WARNING: 2020/10/22 22:26:31 Adjusting keepalive ping interval to minimum period of 10s
INFO: 2020/10/22 22:26:31 parsed scheme: "dns"
time="2020-10-22T22:26:31+01:00" level=info msg="started cache controller"
INFO: 2020/10/22 22:26:31 ccResolverWrapper: sending update to cc: {[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/22 22:26:31 base.baseBalancer: got new ClientConn state:  {{[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/22 22:26:31 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:26:31 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
INFO: 2020/10/22 22:26:31 CPU time info is unavailable on non-linux or appengine environment.
WARNING: 2020/10/22 22:26:31 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:31 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
time="2020-10-22T22:26:31+01:00" level=info msg="started namespace cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg=listening
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc0003023b0, READY
INFO: 2020/10/22 22:26:32 roundrobinPicker: newPicker called with info: {map[0xc0003023b0:{{127.0.0.1:8899  <nil> 0 <nil>}}]}
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:26:32 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/22 22:26:32 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/22 22:26:32 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
time="2020-10-22T22:26:32+01:00" level=info msg="started cache controller"
WARNING: 2020/10/22 22:26:32 grpc: addrConn.createTransport failed to connect to {127.0.0.1:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc0003023b0, TRANSIENT_FAILURE
time="2020-10-22T22:26:32+01:00" level=info msg="started namespace cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg="started cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg="started namespace cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg=listening
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
INFO: 2020/10/22 22:26:32 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/22 22:26:32 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:33 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:26:33 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/22 22:26:33 base.baseBalancer: handle SubConn state change: 0xc0003023b0, READY
INFO: 2020/10/22 22:26:33 roundrobinPicker: newPicker called with info: {map[0xc0003023b0:{{127.0.0.1:8899  <nil> 0 <nil>}}]}
INFO: 2020/10/22 22:26:34 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:34 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:34 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:36 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:36 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:36 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:40 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:40 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:40 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:46 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:46 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:46 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:58 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:58 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:58 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:27:01 ccResolverWrapper: sending update to cc: {[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/22 22:27:01 base.baseBalancer: got new ClientConn state:  {{[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/22 22:27:02 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:27:02 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/22 22:27:02 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/22 22:27:02 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/10/22 22:27:02 grpc: addrConn.createTransport failed to connect to {127.0.0.1:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:27:02 base.baseBalancer: handle SubConn state change: 0xc0003023b0, TRANSIENT_FAILURE
--- FAIL: TestRetriesUntilServerAvailable (30.62s)
    server_integration_test.go:85: didn't complete in 30 seconds
pingles commented 3 years ago

I'm reading through the links you shared @dagvl (thanks again). This looks like a relevant one also: https://github.com/grpc/grpc/issues/12295

It reads like there was a bug that was introduced, but also there's some recommendation about changing the server to set a maximum connection age; once connections are closed the names are updated. Maybe it's worth us adding some configuration to the server for that?

pingles commented 3 years ago

Also highlight the same issue

pingles commented 3 years ago

So I don't forget, I think the remaining work for this issue is to try and get a failing test written (server_integration_test.go was the starting point) with the current v1.27 gRPC release and see if it improves with v1.28. Additionally, the gRPC advice is to configure the keep-alive setting.

As described in https://github.com/pomerium/pomerium/pull/509

As of v0.26, go-grpc no longer polls DNS regularly for endpoint updates. Discovery is only performed when a client detects a state change. This is triggered during error conditions, but routine scaling operations in an otherwise functioning environment may leave endpoints underutilized. To ensure we occasionally re-discover endpoints even when no connection errors have been encountered, the recommended pattern is to set server side options which gracefully terminate client connections at regular intervals.

pingles commented 3 years ago

https://github.com/grpc/grpc-go/issues/3353#issuecomment-582040107

QuantumBJump commented 3 years ago

As an addition to this (different circumstances but I'm pretty sure it's the same thing happening), my company has recently been doing some chaos testing in which we manually partition a given AZ (by adding DENY * to the NetACLs), and this same problem seems to occur for all kiam-agent pods on instances in the affected AZ. I'm assuming this is caused similarly by the lack of a SHUTDOWN message being received, and would be fixed by configuring the keep-alive?

vishalmaheswari commented 3 years ago

i seem to get the same issue on the cluster. Redeploying the agent.yaml fixes the issue

{"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = resolver returned no addresses","pod.ip":"10.202.24.186","time":"2020-12-16T04:07:49Z"}

Has anyone found a workaround for this one.? Any recommendation is really appreciated

pingles commented 3 years ago

@QuinnStevens that's my expectation and the advice from the gRPC team in their project.

We're close to releasing v4 (I'm going to make #449 the only thing we do prior to release) which will go out in another beta for us to test against but it should be production-ready. There's a few breaking changes in v4 noted in an upgrade doc but if you're keen to give it a go, the current beta would be a good option. Indeed, it'd be great to see how it behaves in your chaos experiments.

pingles commented 3 years ago

We've updated gRPC and the changes in #433 should make it possible to mitigate the situation, according to the advice from the gRPC team: controlling the connection age to be shorter forces clients to re-poll for servers frequently.

I'm going to close this issue for now and people should follow that advice to manage. If we see other problems we can re-open.

Thanks for everyone in contributing and helping us to close this down. It was also reported in #217.

limianwang commented 3 years ago

We are currently running into this issue in v3.6. Just to confirm, it seems that the issue has been resolved, but only patched to v4 is that correct?