uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Liveness probe failing with 404 #341

Open nuwang opened 4 years ago

nuwang commented 4 years ago

I've been trying to setup kiam and got to a point where the server appears to be correctly starting up and obtaining the relevant credentials. However, the agent keeps restarting because of a failing health check. No errors are shown in either the server or the agent other than a 404 on the agent's /health endpoint.

Command used to install

helm install uswitch/kiam --set agent.host.iptables=true --set agent.log.level=debug --set server.log.level=debug --set server.useHostNetwork=true --set server.service.port=7443 --set server.service.targetPort=7443 --set agent.extraEnv[0].name=GRPC_GO_LOG_SEVERITY_LEVEL --set agent.extraEnv[0].value=debug --set agent.extraEnv[1].name=GRPC_GO_LOG_VERBOSITY_LEVEL --set agent.extraEnv[1].value=\'10\' --set agent.host.interface=\!eth0 --set server.sslCertHostPath=/usr/share/ca-certificates/mozilla --set server.assumeRoleArn=arn:aws:iam::12345678012:role/kiam_server --set agent.gatewayTimeoutCreation=3s --set agent.deepLivenessProbe=true

Server logs (partially sanitized)

{"level":"info","msg":"starting server","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"detecting arn prefix","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"using detected prefix: arn:aws:iam::12345678012:role/","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"will serve on 0.0.0.0:7443","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 0","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 1","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 2","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 3","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 4","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 5","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 6","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"starting credential manager process 7","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"started cache controller","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"canal-8vvlg","pod.namespace":"kube-system","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"12600","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"coredns-5678df9bcc-svxq5","pod.namespace":"kube-system","pod.status.ip":"10.42.0.89","pod.status.phase":"Running","resource.version":"12357","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"default-http-backend-97bf46cd4-f9br7","pod.namespace":"ingress-nginx","pod.status.ip":"10.42.0.90","pod.status.phase":"Running","resource.version":"12401","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"nfs-provisioner-nfs-server-provisioner-0","pod.namespace":"cloudman","pod.status.ip":"10.42.0.111","pod.status.phase":"Running","resource.version":"12670","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"cm2-node-role","pod.name":"cloudman-cloudlaunchserver-celery-69b864dfcf-gtrmv","pod.namespace":"cloudman","pod.status.ip":"10.42.0.126","pod.status.phase":"Running","resource.version":"147309","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"announced pod","pod.iam.role":"cm2-node-role","pod.name":"cloudman-cloudlaunchserver-celery-69b864dfcf-gtrmv","pod.namespace":"cloudman","pod.status.ip":"10.42.0.126","pod.status.phase":"Running","resource.version":"147309","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"alertmanager-cloudman-prometheus-alertmanager-0","pod.namespace":"cloudman","pod.status.ip":"10.42.0.109","pod.status.phase":"Running","resource.version":"12646","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-ui-595fcf7b9b-rn97s","pod.namespace":"cloudman","pod.status.ip":"10.42.0.96","pod.status.phase":"Running","resource.version":"12660","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-keycloak-0","pod.namespace":"cloudman","pod.status.ip":"10.42.0.95","pod.status.phase":"Running","resource.version":"12795","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"rke-ingress-controller-deploy-job-8c6h6","pod.namespace":"kube-system","pod.status.ip":"10.0.7.112","pod.status.phase":"Succeeded","resource.version":"562","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"rke-network-plugin-deploy-job-4n65j","pod.namespace":"kube-system","pod.status.ip":"10.0.7.112","pod.status.phase":"Succeeded","resource.version":"344","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"vocal-lemur-kiam-agent-cjbjw","pod.namespace":"cloudman","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"159387","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-postgresql-0","pod.namespace":"cloudman","pod.status.ip":"10.42.0.112","pod.status.phase":"Running","resource.version":"12770","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"metrics-server-784769f887-t4lx5","pod.namespace":"kube-system","pod.status.ip":"10.42.0.92","pod.status.phase":"Running","resource.version":"12293","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cert-manager-cainjector-54c4796c5d-9nrpq","pod.namespace":"cert-manager","pod.status.ip":"10.42.0.94","pod.status.phase":"Running","resource.version":"12582","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-prometheus-operator-7b78b44dbb-8w9ww","pod.namespace":"cloudman","pod.status.ip":"10.42.0.102","pod.status.phase":"Running","resource.version":"12538","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-rabbitmq-0","pod.namespace":"cloudman","pod.status.ip":"10.42.0.97","pod.status.phase":"Running","resource.version":"12656","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"prometheus-cloudman-prometheus-prometheus-0","pod.namespace":"cloudman","pod.status.ip":"10.42.0.108","pod.status.phase":"Running","resource.version":"12551","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cert-manager-665898448d-8f8kw","pod.namespace":"cert-manager","pod.status.ip":"10.42.0.99","pod.status.phase":"Running","resource.version":"12586","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-grafana-ccc54dc69-ndwcx","pod.namespace":"cloudman","pod.status.ip":"10.42.0.107","pod.status.phase":"Running","resource.version":"12569","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"romping-wallaby-kubernetes-dashboard-5f4c75cccf-hfgpp","pod.namespace":"kube-system","pod.status.ip":"10.42.0.103","pod.status.phase":"Running","resource.version":"12622","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"tiller-deploy-659c6788f5-cnt8s","pod.namespace":"kube-system","pod.status.ip":"10.42.0.105","pod.status.phase":"Running","resource.version":"12634","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"csi-cvmfsplugin-provisioner-0","pod.namespace":"cvmfs","pod.status.ip":"10.42.0.104","pod.status.phase":"Running","resource.version":"12575","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"csi-cvmfsplugin-wkkzl","pod.namespace":"cvmfs","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"12638","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"nginx-ingress-controller-ft74r","pod.namespace":"ingress-nginx","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"12619","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"coredns-autoscaler-57bc9c9bd-4bptv","pod.namespace":"kube-system","pod.status.ip":"10.42.0.91","pod.status.phase":"Running","resource.version":"12390","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"rke-coredns-addon-deploy-job-bv4wt","pod.namespace":"kube-system","pod.status.ip":"10.0.7.112","pod.status.phase":"Succeeded","resource.version":"407","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-kube-state-metrics-64cb69656c-sgzp9","pod.namespace":"cloudman","pod.status.ip":"10.42.0.93","pod.status.phase":"Running","resource.version":"12626","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-prometheus-node-exporter-78sbj","pod.namespace":"cloudman","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"12559","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"cm2-node-role","pod.name":"cloudman-cloudlaunchserver-75795b47d5-rhfb8","pod.namespace":"cloudman","pod.status.ip":"10.42.0.125","pod.status.phase":"Running","resource.version":"140497","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"announced pod","pod.iam.role":"cm2-node-role","pod.name":"cloudman-cloudlaunchserver-75795b47d5-rhfb8","pod.namespace":"cloudman","pod.status.ip":"10.42.0.125","pod.status.phase":"Running","resource.version":"140497","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"rke-metrics-addon-deploy-job-tvz4m","pod.namespace":"kube-system","pod.status.ip":"10.0.7.112","pod.status.phase":"Succeeded","resource.version":"475","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"vocal-lemur-kiam-server-q5tkk","pod.namespace":"cloudman","pod.status.ip":"10.0.7.112","pod.status.phase":"Pending","resource.version":"159515","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cloudman-influxdb-0","pod.namespace":"cloudman","pod.status.ip":"10.42.0.110","pod.status.phase":"Running","resource.version":"12530","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"csi-cvmfsplugin-attacher-0","pod.namespace":"cvmfs","pod.status.ip":"10.42.0.106","pod.status.phase":"Running","resource.version":"12544","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cattle-cluster-agent-8577486bc7-j75sz","pod.namespace":"cattle-system","pod.status.ip":"10.42.0.101","pod.status.phase":"Running","resource.version":"12706","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"added pod","pod.iam.role":"","pod.name":"cattle-node-agent-75xhc","pod.namespace":"cattle-system","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"12567","time":"2019-12-13T20:41:32Z"} 
{"level":"info","msg":"started namespace cache controller","time":"2019-12-13T20:41:32Z"} 
{"credentials.access.key":"ASIAVBVEAZ4BTSANTIZED","credentials.expiration":"2019-12-13T20:56:32Z","credentials.role":"cm2-node-role","level":"info","msg":"requested new credentials","time":"2019-12-13T20:41:32Z"} 
{"credentials.access.key":"ASIAVBVEAZ4BTSANTIZED","credentials.expiration":"2019-12-13T20:56:32Z","credentials.role":"cm2-node-role","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"cm2-node-role","pod.name":"cloudman-cloudlaunchserver-75795b47d5-rhfb8","pod.namespace":"cloudman","pod.status.ip":"10.42.0.125","pod.status.phase":"Running","resource.version":"140497","time":"2019-12-13T20:41:32Z"} 
{"credentials.access.key":"ASIAVBVEAZ4BTSANTIZED","credentials.expiration":"2019-12-13T20:56:32Z","credentials.role":"cm2-node-role","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"cm2-node-role","pod.name":"cloudman-cloudlaunchserver-celery-69b864dfcf-gtrmv","pod.namespace":"cloudman","pod.status.ip":"10.42.0.126","pod.status.phase":"Running","resource.version":"147309","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"cattle-system","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"cloudman","namespace.permitted":".*","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"cvmfs","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"default","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"kube-public","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"kube-system","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"cert-manager","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"ingress-nginx","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"level":"debug","msg":"added namespace","namespace":"kube-node-lease","namespace.permitted":"","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"updated pod","pod.iam.role":"","pod.name":"vocal-lemur-kiam-server-q5tkk","pod.namespace":"cloudman","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"159524","time":"2019-12-13T20:41:32Z"} 
{"generation.metadata":0,"level":"debug","msg":"updated pod","pod.iam.role":"","pod.name":"vocal-lemur-kiam-server-q5tkk","pod.namespace":"cloudman","pod.status.ip":"10.0.7.112","pod.status.phase":"Running","resource.version":"159536","time":"2019-12-13T20:41:36Z"} 

Agent log

{"level":"info","msg":"configuring iptables","time":"2019-12-13T20:42:15Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2019-12-13T20:42:15Z"}
{"level":"info","msg":"listening :8181","time":"2019-12-13T20:42:15Z"}
{"level":"info","msg":"stopped","time":"2019-12-13T20:42:25Z"}
{"level":"info","msg":"starting server shutdown","time":"2019-12-13T20:42:25Z"}
{"level":"info","msg":"gracefully shutdown server","time":"2019-12-13T20:42:25Z"}

Liveness endpoint status

/ # curl localhost:8181/health
/ # curl localhost:8181/ping
/ # curl localhost:8181/

all return

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>openresty/1.15.8.1</center>
</body>
</html>

Environment

Single node cluster. Rancher v2.1.7 Kube version:

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:18:23Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:07:57Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Things I've tried

  1. The agent can curl the server on port 7443
  2. Tried changing agent interface to docker0, cali+ etc but that didn't change the issue.
  3. The main issue seems to be that all urls on port 8181 return a 404.
  4. Enabling deeplivenessprobe changed nothing.
  5. The server is correctly identifying the annotated namespaces, and obtaining its role
  6. Disabling the liveness probe and trying to assume a role doesn't seem to work.
  7. Downgrading to older version of the chart/images.
nuwang commented 4 years ago

It started working after specifying agent.host.port so it looks like something else was running on 8181. Should this be filed as a bug because the agent makes no indication that something else was running on 8181?

This is the final working install command:

helm install uswitch/kiam --set agent.host.iptables=true --set server.useHostNetwork=true --set server.service.port=7443 --set server.service.targetPort=7443 --set server.log.level=debug --set server.extraEnv[0].name=GRPC_GO_LOG_SEVERITY_LEVEL --set server.extraEnv[0].value=debug --set server.extraEnv[1].name=GRPC_GO_LOG_VERBOSITY_LEVEL --set server.extraEnv[1].value=\'10\' --set agent.log.level=debug  --set agent.extraEnv[0].name=GRPC_GO_LOG_SEVERITY_LEVEL --set agent.extraEnv[0].value=debug --set agent.extraEnv[1].name=GRPC_GO_LOG_VERBOSITY_LEVEL --set agent.extraEnv[1].value=\'10\' --set agent.host.interface=\!eth0 --set server.sslCertHostPath=/usr/share/ca-certificates/mozilla --set server.assumeRoleArn=arn:aws:iam::123456789012:role/kiam_server --set agent.gatewayTimeoutCreation=1s --set agent.host.port=9021
calebklahre commented 4 years ago

So I actually ran into this same issue and the other source was coredns that sets up its readiness http port on port 8181. Kiam doesn't seem to necessarily bind to a port specifically and therefore never seems to throw an error that that port is in use.

robvadai commented 3 years ago

Port 8181 was working fine yesterday for me, today started saying something occupied the port. Changed the agent port and that issue gone away. I have the feeling it's a bit flaky, although I was deploying/removing many times for testing.