uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Agent pod fails to start #243

Open Lokicity opened 5 years ago

Lokicity commented 5 years ago

Hi,

I am experimenting with a manual setup of KIAM. I am using image

Image:         quay.io/uswitch/kiam:master
    Image ID:      docker-pullable://quay.io/uswitch/kiam@sha256:8e44945f82449d321f03ec4d3c6577d7244c50401e10e03f719c4376a34de1dc

My server is running properly:

root@master-c23b88ce-6091-11e9-a4fb-06093d98c8ca` [ ~ ]# kubectl logs kiam-server-jncc4 -n vke-system
{"level":"info","msg":"starting server","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"will serve on 0.0.0.0:443","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 0","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 1","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 2","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 3","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 4","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 5","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 6","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"starting credential manager process 7","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"started cache controller","time":"2019-04-22T05:34:53Z"}
{"level":"info","msg":"started namespace cache controller","time":"2019-04-22T05:34:53Z"}
{"credentials.access.key":"REDACTED","credentials.expiration":"2019-04-22T05:49:53Z","credentials.role":"arn:aws:iam::12345:role/kiam-customer-1","level":"info","msg":"requested new credentials","time":"2019-04-22T05:34:53Z"}
{"credentials.access.key":"<REDACTED>","credentials.expiration":"2019-04-22T05:49:53Z","credentials.role":"arn:aws:iam::12345:role/kiam-customer-1","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"arn:aws:iam::12345:role/kiam-customer-1","pod.name":"aws-provider-controller-manager-0","pod.namespace":"aws-provider-system-1","pod.status.ip":"10.2.1.21","pod.status.phase":"Running","resource.version":"1292494","time":"2019-04-22T05:34:53Z"}
{"credentials.access.key":"REDACTED","credentials.expiration":"2019-04-22T05:49:53Z","credentials.role":"arn:aws:iam::12345:role/kiam-customer-1","level":"info","msg":"notified credentials expire soon","time":"2019-04-22T05:45:53Z"}
{"credentials.access.key":"REDACTED","credentials.expiration":"2019-04-22T05:49:53Z","credentials.role":"arn:aws:iam::12345:role/kiam-customer-1","level":"info","msg":"expiring credentials, fetching updated","time":"2019-04-22T05:45:53Z"}
{"credentials.access.key":"REDACTED","credentials.expiration":"2019-04-22T06:00:53Z","credentials.role":"arn:aws:iam::12345:role/kiam-customer-1","level":"info","msg":"requested new credentials","time":"2019-04-22T05:45:53Z"}

I can manually ssh into the KIAM server pod, install AWS cli, create a aws config file with the proper assume role, and assume the "arn:aws:iam::12345:role/kiam-customer-1" role properly. That means my master/kiam server is wired up properly.

However, when I try to run the agent on worker nodes, I set the - --server-address=kiam-server:443, I can confirm that the dns is resolved to the kiam-server pod IP. However, it error out saying:

{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2019-04-22T05:03:55Z"}
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2019-04-22T05:03:56Z"}

Just to test around, I set the server-address to master LB's dns and tried again. This time, the error is:

{"addr":"10.2.1.18:35626","level":"error","method":"GET","msg":"error processing request: rpc error: code = Internal desc = transport: received the unexpected content-type \"application/json\"","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-04-22T04:17:41Z"}
{"addr":"10.2.1.18:35626","duration":4992,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-04-22T04:17:41Z"}

I am guessing it is because I have set the server to actual Kubernetes API server, since it is running REST, KIAM agent is complaining that it doesn't understand the REST response.

Other things I have done including, In the example yaml, I see that the the KIAM service is not given a clusterIP intentionally. I tried give it a clusterIP and have agent point at the clusterIP endpoint, it goes back to the error {"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2019-04-22T05:03:56Z"}.

I am kind of stucked at this point. If my KIAM server is running properly and on my agent, when setting "- --server-address=kiam-server:443" it is resolving DNS properly. What caused the agent to error out saying "{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2019-04-22T05:03:56Z"}."?

Thanks.

pinkavaj commented 5 years ago

I have hit error with the identical simptoms when installing kiam using helm. I have found the error is triggered when the installation in helm have name different than kiam. Extremely strange but replicable.

I can send You manifests generated by helm, so you can adapt them and try to deploy with those, mebe it will fix (wourkaround) you problem.

ballerabdude commented 5 years ago

Pinkavaj is right. This only happens when you use a different name than kiam

jhohertz commented 4 years ago

Running into this on 1.17.0-rc.2, not quite sure yet what's going on, or what about 1.17.x is different from 1.16.x, where this seems to work just fine. Nothing jumps out at me in the changelogs.

Agents just can't reach the server endpoints, with the cited "error dialing grpc server" message emitted before going into crashloop.

Update: I have tried increasing the timeout to 5s to no avail. Service/endpoints present the same for both versions with no clusterIP, which I was hoping to see a variation on... Masters seem fine, they see the annotated items and collect/cache tokens. Just the agents that seem to have issues on 1.17.

jhohertz commented 4 years ago

Just a note there still appears to be an incompatibility with 1.17 as of the current 1.17.2 The odd thing is... I generally will see ONE agent pod that does in fact work, but the rest will end up with GRPC errors.

glaurungg commented 4 years ago

Also using helm, and I think the issue comes from the _helpers.tpl code trying to determine the kiam.server.fullname value here.

If you name the server something besides "server" (in my case, "kiam-server"), the alternate subject names come out with some extra prefixes and then TLS won't work between the agent and server:

-->kubectl get secret -n kube-system kiam-server -o json | jq -r .data.cert  | base64 -D > cert && openssl x509 -in cert -text | grep -A 1 "Subject Alternative Name"
            X509v3 Subject Alternative Name:
                DNS:kiam-kiam-server, DNS:kiam-kiam-server:443, DNS:127.0.0.1:443, IP Address:127.0.0.1
-->kubectl get service -n kube-system | grep kiam
kiam-agent    ClusterIP   None         <none>        9620/TCP           8m25s
kiam-server   ClusterIP   None         <none>        9620/TCP,443/TCP   8m25s

I was able to fix this by specifying the following chart values:

fullnameOverride: kiam
agent:
  fullnameOverride: kiam-agent
server:
  fullnameOverride: kiam-server

and then the alternative names went back to the expected service names in the server cert:

-->kubectl get secret -n kube-system kiam-server -o json | jq -r .data.cert  | base64 -D > cert && openssl x509 -in cert -text | grep -A 1 "Subject Alternative Name"
            X509v3 Subject Alternative Name:
                DNS:kiam-server, DNS:kiam-server:443, DNS:127.0.0.1:443, IP Address:127.0.0.1
wolfpack94 commented 4 years ago

We were getting the same error here and it turned out to be our TLS certificates had expired. Once those were regenerated, the error went away

jkassis commented 4 years ago

not working for me... even with fullnameOverride

{"level":"info","msg":"configuring iptables","time":"2020-07-29T17:15:43Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-07-29T17:15:43Z"}
INFO: 2020/07/29 17:15:43 parsed scheme: "dns"
INFO: 2020/07/29 17:15:43 ccResolverWrapper: got new service config: 
INFO: 2020/07/29 17:15:43 ccResolverWrapper: sending new addresses to cc: [{10.0.201.249:443 1 10-0-201-249.kiam-server.kiam.svc.cluster.local. <nil>} {10.0.148.190:443 1 10-0-148-190.kiam-server.kiam.svc.cluster.local. <nil>} {10.0.188.91:443 1 10-0-188-91.kiam-server.kiam.svc.cluster.local. <nil>} {10.0.148.190:443 0  <nil>} {10.0.188.91:443 0  <nil>} {10.0.201.249:443 0  <nil>}]
INFO: 2020/07/29 17:15:43 base.baseBalancer: got new ClientConn state:  {{[{10.0.148.190:443 0  <nil>} {10.0.188.91:443 0  <nil>} {10.0.201.249:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/07/29 17:15:43 base.baseBalancer: handle SubConn state change: 0xc000406d40, CONNECTING
INFO: 2020/07/29 17:15:43 base.baseBalancer: handle SubConn state change: 0xc000406d60, CONNECTING
INFO: 2020/07/29 17:15:43 base.baseBalancer: handle SubConn state change: 0xc000406d80, CONNECTING
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-07-29T17:15:48Z"}
wolfpack94 commented 4 years ago

@jkassis this is how we solved it.

  1. kubectl delete -f server.yaml -f agent.yaml -f kiam-certs.yaml
  2. kubectl -n kube-system delete secret kiam-server-tls kiam-agent-tls Bring everything back up in this order:
  3. kubectl apply -f kiam-certs.yaml
  4. kubectl apply -f server.yaml
  5. kubectl apply -f agent.yaml

I hope that helps

jkassis commented 4 years ago

thanks @wolfpack94 still no luck. i can delete the agent pods after the server comes up and verify that the cert is installed... [I] jkassis@Jeremys-MBP ~/c/c/live> kubectl get secret -n kiam kiam-server -o json | jq -r .data.cert | base64 -D > cert && openssl x509 -in cert -text | grep -A 1 "Subject Alternative Name" 07.29 10:48 X509v3 Subject Alternative Name: DNS:*.kiam-server.kiam.svc.cluster.local, DNS:kiam-server.kiam.svc.cluster.local, DNS:localhost, URI:kiam-server.kiam.svc.cluster.local:9520, URI:kiam-server:443, URI:localhost:443, URI:localhost:9520

i wonder if somehow my cert's alternative name isn't matching?!? i notice that all the kiam-server.kiam.svc.cluster.local addresses are prefixed in the log output i posted above. i added a wildcard to my hosts for the cert.

jkassis commented 4 years ago

on the kiam-server side i see this... WARNING: 2020/07/29 19:08:36 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: read tcp [::1]:443->[::1]:43056: read: connection reset by peer" INFO: 2020/07/29 19:08:36 transport: loopyWriter.run returning. connection error: desc = "transport is closing" WARNING: 2020/07/29 19:08:39 grpc: Server.Serve failed to complete security handshake from "127.0.0.1:39554": EOF INFO: 2020/07/29 19:08:39 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

jkassis commented 4 years ago

treating this as a security / auth issue... i tested a version of kiam built with security disabled... https://github.com/uswitch/kiam/commit/424ce1273cc24ecb9c69d367fe990765166387b9. i'm still getting this... on the agent...

image

On the server i continue to see this...


WARNING: 2020/07/30 15:20:20 grpc: Server.Serve failed to complete security handshake from "127.0.0.1:46922": EOF
INFO: 2020/07/30 15:20:20 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/07/30 15:20:30 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:443->127.0.0.1:47036: read: connection reset by peer```
jkassis commented 4 years ago

i shutdown all of the agents and still see these logs... WARNING: 2020/07/30 19:34:45 grpc: Server.Serve failed to complete security handshake from "[::1]:47950": EOF INFO: 2020/07/30 19:34:47 transport: loopyWriter.run returning. connection error: desc = "transport is closing" WARNING: 2020/07/30 19:34:47 grpc: Server.Serve failed to complete security handshake from "[::1]:48008": EOF WARNING: 2020/07/30 19:34:55 grpc: Server.Serve failed to complete security handshake from "127.0.0.1:60980": EOF INFO: 2020/07/30 19:34:55 transport: loopyWriter.run returning. connection error: desc = "transport is closing" WARNING: 2020/07/30 19:34:57 grpc: Server.Serve failed to create ServerTransport: EOF

so this is on the server exclusively.

jkassis commented 4 years ago

i'm looking at my underlying CNI provider. In Openshift v4.5 its OVN-Kubernetes out the the box. The support matrix indicates there are some support gaps. Could this be a problem? image

jkassis commented 4 years ago

i believe my problems were related to the default Openshift CNI plugins... both the SDN and OVN networking layers failed at this. Switching to calico networking as solved what is essentially the same problem with kube2iam. i'm pretty confident that kiam would work with this as well since the issue was related to the hardcoded firewall rule that blocks access to AWS and prevents reconfiguring iptables.