Interface conversion fails, causing pods not to get credentials.

kinghajj commented 5 years ago

Sometimes, pods in my cluster fail with errors like "fatal error: Unable to locate credentials". Checking the kube2iam logs for the instance on the same node, I see entries like this:

time="2018-11-20T18:40:26Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=172.19.97.48 res.status=500

This occurs on an EKS cluster running Kubernetes 1.10.3.

rifelpet commented 5 years ago

I've seen this same issue on Kubernetes 1.10.9 on a Kops-provisioned cluster. Using the latest 0.10.4 image. Very intermittent, I haven't been able to create a guaranteed repro case.

There are only 2 cases of *v1.Pod type assertion: https://github.com/jtblin/kube2iam/blob/master/k8s/k8s.go#L100-L106

The k8s.io/client-go library is pinned to a version over 20 months old and there have been many major releases since then. Perhaps its worth updating?

tmedford commented 5 years ago

We have seen this error as well. We upvoted to have this pulled in as well.

admssa commented 5 years ago

getting this on EKS 1.11.5

jrnt30 commented 5 years ago

I will look at updating tomorrow

tmedford commented 5 years ago

@jrnt30 Can we make sure to get a new build to? We will start experimenting whenever you release.

admssa commented 5 years ago

rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)

jrnt30 commented 5 years ago

You only started seeing this is 0.10.6? Logs would help. Are the failures for running or deleted pods?

On Sat, Jan 26, 2019 at 15:26 Sergii Shapovalenko notifications@github.com wrote:

rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jtblin/kube2iam/issues/178#issuecomment-457867339, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWanESWzlvvkiCszNJB3ewhKPaocHUjks5vHMf9gaJpZM4YrwBM .

admssa commented 5 years ago

@jrnt30 Yes, I used "latest" tag for docker image. this only affected the pods on the nodes that were created after 0.10.6 release. Then i set tag 0.10.4 and error disappeared. \ I was getting

aws sts get-caller-identity

Unable to locate credentials. You can configure credentials by running "aws configure".

tmedford commented 5 years ago

We were on 10.4 and were having the error where pod could not get creds after term. After the update to 10.6 that went away for the application that has very variable scaling and has been very stable. Another application, that uses kubernetes jobs on static non-autoscaling boxes was having intermittent errors "unable to sign request without credentials set" with error in kube2iam. We had seen this in 10.4 though.

{"level":"error","msg":"PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod","req.method":"GET","req.path":"/latest/meta-data/iam/security-credentials/","req.remote":"100.96.192.4","res.status":500,"time":"2019-01-26T21:30:33Z"}.

This would not be related to deleted pods in our case for the fix in 10.6. Running with kops deploy on v1.10.11

hgranillo commented 5 years ago

rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)

Found out the same today, thanks to your comment I realized that I did the same.

lumixen commented 5 years ago

Saw this as well in ours EKS.

thealmightygrant commented 5 years ago

Ditto. EKS 1.10.11

cnelson commented 5 years ago

Also seeing this on EKS Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}, roll back to 0.10.4 fixes it.

yevhenvolchenko commented 5 years ago

The same on AWS EKS Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.11-eks", GitCommit:"6bf27214b7e3e1e47dce27dcbd73ee1b27adadd0", GitTreeState:"clean", BuildDate:"2018-12-04T13:33:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

tprakash17 commented 5 years ago

Did we fix this in the latest image? We are seeing this in EKS.

jrnt30 commented 5 years ago

Apologies for the delay, I have not provisioned nor have a current cluster where I have reproduced this. Is this emergent behavior JUST from the 0.10.6 upgrade or was there also a cluster upgrade to 1.11 from a previous version?

Can anyone send a set of debug level logs that have this issue captured? What is not totally clear is if this interface conversation is occurring for newly deleted pod, new pods coming into the cluster, etc.

The PR that may be causing some of this #173 changed the semantics of what an active pod is. There was some work done to:

Update the K8 go client library
Add debugging statements to potentially identify where/why the conversion is failing

Simply adjusting these and "hoping" that this fixes it seems somewhat lazy and dangerous to me, so I any additional information we could get about the state of the cluster from people actually running into the issue would be extremely helpful.

jrnt30 commented 5 years ago

I released an RC tag to Docker Hub as 0.10.7-rc2 with the changes referenced in #189 (a merge of several different efforts).

As noted in my previous comment, I am unsure if this will address our interface conversion errors. They seem to be happening intermittently on 0.10.4 and more frequently for EKS users on 0.10.6. If we could get a few people to run the RC version to determine if it alleviates the issue that would help determine the best path forward.

JBOClara commented 5 years ago

I've just run the RC version and always seeing intermittent error:

botocore.exceptions.NoCredentialsError: Unable to locate credentials


time="2019-02-05T15:46:25Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:25Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:28Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:28Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2173819697.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.173819697e+09 res.status=404
time="2019-02-05T15:46:29Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:29Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube-downscaler-1549307100-rnj4x pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Pending
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2207612285.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.207612285e+09 res.status=404
time="2019-02-05T15:46:44Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:46Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:48Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:48Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2023309635.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.023309635e+09 res.status=404```

At your disposal for further information.

jrnt30 commented 5 years ago

Thanks for taking the time to collect those logs! If you have found a way to replicate this it would be very much appreciated.

A few questions:

What platform and OS are running into this on?
Would you mind masking the pod IPs in a way that we can see a relationship between the log entries?
Are the pods that are failing "stable" or are they undergoing lifecycle changes?
Are all of these GETs for the same pod?
Are the Pod OnUpdate statements related to that pod?

Also if you have the /debug endpoint enabled it might help us cross-reference the issue.

I will try and dig into the pod indexer logic to see if I can come up with anything but realistically it will be a few days.

Thanks again!

JBOClara commented 5 years ago

What platform and OS are running into this on? Kubernetes v1.10.12 Debian GNU/Linux 9 (stretch) 4.9.0-8-amd64 docker://17.3.2 Provisioned with KOPS 1.11
Would you mind masking the pod IPs in a way that we can see a relationship between the log entries?

kube2iam's log when not working:

time="2019-02-05T21:34:43Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:43Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:45Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:51Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:53Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2653312418.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=2.653312418e+09 res.status=404

my pod's log when not working:

kubectl logs -n default -l run=refresh-creds -p
2019-02-05 21:34:45 INFO     Loading configuration from cluster: kubernetes.default
2019-02-05 21:34:51 INFO     doing something in a loop ...
2019-02-05 21:34:51 INFO     Test logging #1
2019-02-05 21:34:52 INFO     Client STS #1: <botocore.client.STS object at 0x7f1c1c36efd0>
2019-02-05 21:34:52 INFO     ending loop ...
Traceback (most recent call last):
  File "/app/main.py", line 246, in <module>
    loop_controller()
  File "/app/main.py", line 238, in loop_controller
    raise e
  File "/app/main.py", line 220, in loop_controller
    docker, expire_at = ecr_login()
  File "/app/main.py", line 85, in ecr_login
    session = assume_role(ROLE_ARN, "default") # type: Session
  File "/app/main.py", line 67, in assume_role
    account_id = clt.get_caller_identity()["Account"]
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 648, in _make_api_call
    operation_model, request_dict, request_context)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 667, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
    operation_name=operation_model.name)
  File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

debug/store endpoint when not working:

curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
  "namespaceByIP": {},
  "rolesByIP": {},
  "rolesByNamespace": {
    "222-iop-correction-qual-ksl": null,
    .....
    "weave": null
  }
}

kube2iam log when working:

time="2019-02-05T18:00:13Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:13Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:16Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 36243.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=36243 res.status=200
time="2019-02-05T18:00:22Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 427731917.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=4.27731917e+08 res.status=200
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:01:50Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 32940.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=32940 res.status=200
time="2019-02-05T18:01:58Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 75346.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=75346 res.status=200

my pod log when working:

kubectl logs -n default -l run=refresh-creds -p
2019-02-05 18:05:42 INFO     Loading configuration from cluster: kubernetes.default
2019-02-05 18:05:48 INFO     doing something in a loop ...
2019-02-05 18:05:48 INFO     Test logging #1
2019-02-05 18:05:48 INFO     Client STS #1: <botocore.client.STS object at 0x7fd671f182b0>
2019-02-05 18:06:13 INFO     Account ID #1: ACCOUNT_ID
2019-02-05 18:06:39 INFO     Account ID #2: ACCOUNT_ID_2
2019-02-05 18:06:39 INFO     ECR CLIENT: <botocore.client.ECR object at 0x7fd671d7e240>

debug/store endpoint when working:

curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
  "namespaceByIP": {
    "XXX.YYY.0.1": "default"
  },
  "rolesByIP": {
    "XXX.YYY.0.1": "arn:aws:iam::ACCOUNT_ID:role/my-role"
  },
  "rolesByNamespace": {
    "222-iop-correction-qual-ksl": null,
    ....
    "weave": null
  }
}

Are the pods that are failing "stable" or are they undergoing lifecycle changes? They are transiting from Pending to Running. But that also seems to fail when running.
Are all of these GETs for the same pod? The logs in this comment are captured from a tainted node. There is only few daemonset running on this node and only my app's pod running why a kube2iam annotation role.
Are the Pod OnUpdate statements related to that pod? Yes, for this comment.

JessieAMorris commented 5 years ago

I'm getting the same error on 2 different EKS clusters. I had this error on k8s nodes 1.10.3 and now 1.10.11. This seems to work fine for some period of time and then all newly created pods fail. I.e. after 24 hours or so of a node running then newly fetched credentials fail.

garo commented 5 years ago

We had one incident today with the 0.10.6, downgrading back to 0.10.4. Running EKS with 1.10.3.

I have an entrypoint script in the pod which polls the kube2iam metadata api until the credentials are associated and executes the actual pod command only after that. In our single case this bootstrap script never finished, so the bug was there "right from the pod startup" from my point of view.

JBOClara commented 5 years ago

I've implement a test self-destruction logic in my application POD (for testing). If it's not able to get credentials, it self-destructs.

And some times, this works...

But most of the time, it works immediately after kube2iam pod himself.

Unfortunately, I have to consider to switch for kiam.

wmorgan6796 commented 5 years ago

We've began noticing issues with 0.10.6 as well. Downgrading to 0.10.4 seems to work.

lanmalkieri commented 5 years ago

Can also confirm in EKS seeing this issue, going from tag:latest to tag:0.10.4 100% fixed this for us.

krjackso commented 5 years ago

Just experienced this on kube2iam 0.10.6, NOT running in EKS but with a kops cluster using weave.

Most of our deployments were okay, but two of them were getting 500's from kube2iam, kube2iam logs had a similar message for these 2 pods. What I noticed is that these are our deployments that are deployed with strategy Recreate and only 1 replica, which happens to keep the same pod IP address on the new pods. Is there a chance the pod IP lookup is broken? We ran for almost 10 minutes this way, and I was able to recover by deleting the kube2iam pods on the node of my affected pods (which would refresh whatever pod cache exists). Then I downgraded kube2iam to 0.10.4 and did another release without issues.

jrnt30 commented 5 years ago

I released an 0.10.7-rc3 release that reverts the logic that was introduced in #173 which I believe is the main culprit.

The observation about the Recreate may be a good lead and something to try however I am at a wedding this weekend and unable to invest enough time to test some of those scenarios (forgot my laptop charger). I'm going to push the next RC release to see if that resolves some of people's issues.

JBOClara commented 5 years ago

I've the same error with 0.10.7-rc3. I can't even get it works intermittently.

I've notice that error

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found

When enabling verbose DEBUG log on my application, I've notice socket timeout.

2019-02-09 10:51:52 DEBUG    Looking for credentials via: boto-config
2019-02-09 10:51:52 DEBUG    Looking for credentials via: container-role
2019-02-09 10:51:52 DEBUG    Looking for credentials via: iam-role
2019-02-09 10:51:53 DEBUG    Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/meta-data/iam/security-credentials/: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/botocore/httpsession.py", line 258, in send
    decode_content=False,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 343, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: AWSHTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=1)

During handling of the above exception, another exception occurred:

The response time is pretty hight:

time curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found

real    0m2.278s
user    0m0.004s
sys 0m0.000s

And the debug/store:

curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
  "namespaceByIP": {},
  "rolesByIP": {},
  "rolesByNamespace": {

But, with kube2iam version 0.10.4:

debug/store is correctly populate:

{
  "namespaceByIP": {
    "100.114.0.1": "default",
    "100.114.0.2": "default"
  },
  "rolesByIP": {
    "100.114.0.1": "my-role",
    "100.114.0.2": ""
  },
  "rolesByNamespace": {

My app retreive the role's creds:


2019-02-09 10:56:52 DEBUG    Found credentials from IAM Role: my-role
2019-02-09 10:56:52 DEBUG    Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-02-09 10:56:52 DEBUG    Event choose-service-name: calling handler <function handle_service_name_alias at 0x7f537d039598>
2019-02-09 10:56:52 DEBUG    Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/sts/2011-06-15/service-2.json
2019-02-09 10:56:52 DEBUG    Event creating-client-class.sts: calling handler <function add_generate_presigned_url at 0x7f537d072e18>
2019-02-09 10:56:52 DEBUG    The s3 config key is not a dictionary type, ignoring its value of: None
2019-02-09 10:56:52 DEBUG    Setting sts timeout as (60, 60)
2019-02-09 10:56:52 DEBUG    Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/_retry.json
2019-02-09 10:56:52 DEBUG    Registering retry handlers for service: sts
2019-02-09 10:56:52 INFO     Client STS #1: <botocore.client.STS object at 0x7f537b11c198>
2019-02-09 10:56:52 DEBUG    Event before-parameter-build.sts.GetCallerIdentity: calling handler <function generate_idempotent_uuid at 0x7f537d055b70>
2019-02-09 10:56:52 DEBUG    Event before-call.sts.GetCallerIdentity: calling handler <function inject_api_version_header_if_needed at 0x7f537cfd9a60>
2019-02-09 10:56:52 DEBUG    Making request for OperationModel(name=GetCallerIdentity) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.86 Python/3.7.2 Linux/4.9.0-8-amd64 Botocore/1.12.86'}, 'body': {'Action': 'GetCallerIdentity', 'Version': '2011-06-15'}, 'url': 'https://sts.amazonaws.com/', 'context': {'client_region': 'aws-global', 'client_config': <botocore.config.Config object at 0x7f537b11c470>, 'has_streaming_input': False, 'auth_type': None}}
2019-02-09 10:56:52 DEBUG    Event request-created.sts.GetCallerIdentity: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7f537b11c438>>
2019-02-09 10:56:52 DEBUG    Event choose-signer.sts.GetCallerIdentity: calling handler <function set_operation_specific_signer at 0x7f537d055a60>
2019-02-09 10:56:52 DEBUG    Calculating signature using v4 auth.
2019-02-09 10:56:52 DEBUG    CanonicalRequest:
POST

metacyclic commented 5 years ago

Saw this today on EKS GitVersion:"v1.10.11-eks". The kube2iam pod was running 0.10.4. I killed the pod, and the daemonset brought another back, this time with 0.10.6 and it is doing what it is supposed to.

All the other worker nodes running kube2iam are "older" than the one that had the issue. It almost feels like AWS is ripping out something from under kube2iam and it gets in this state?

conversicachrisr commented 5 years ago

We are experiencing this issue as well: EKS 1.10.11, aws-vpc-cni-k8s 1.30 0.10.4 seems to resolve the issue. Occurred on 0.10.6

lbernail commented 5 years ago

I found this issue because we encountered the issue solved by #173 (no credentials for pods in the terminating phase). Since the original issue has been hard on us, I think I'll try to cherry-pick #173 on 0.10.4. I'll let you know if it works or not for us (at least we could get a confirmation that the root cause is #173)

lbernail commented 5 years ago

OK we tried 0.10.4 with #173 The fix for terminating pods works great but we've also started to see the interface condition issue. It's hard to correlate it with other events from debug logs (I see a "Pod OnUpdate" event and 8 minutes later I start getting the error.

I'm dumping the content of /debug/store every 5mn on a node where I know the problem happens regularly and I'll see if I can associate the store becoming empty to a specific event.

If it's not enough, I'll probably do a custom build with additional debug info (if anyone has suggestions, feel free to mention them)

lbernail commented 5 years ago

The error is happening very rarely with 0.10.4 and the termination patch I also got the error on the debug/store route at one point but only for a single pod IP at a time so it seems the index sometimes has invalid IP keys. I wonder if there is a guarantee that Indexers and Informer content are in sync

bpineau commented 5 years ago

Seems we should remove the DeletionTimestamp tests in isPodActive()? :

Pod's DeletionTimestamp is set when the pod deletion process begins, and expires (podDeleted=true in isPodActive()) when the pod is effectively deleted, after terminationGracePeriodSeconds.

The deletion event is received by our informer on pod's deletion final step, when DeletionTimestamp is expired.

So the timestamp tests in isPodActive() are of no use during most of pod's lifecycle, and have an adverse effect on deletions events, which are ignored (due to the !podDeleted test) and leave rogue index entries behind. Which would cause panics when acceded (eg. by reaching /debug/store).

rifelpet commented 5 years ago

My reasoning for checking the value of the timestamp rather than just ignoring the field altogether was to avoid this potential security risk:

A pod begins the deletion process. Its terminationGracePeriodSeconds field is set. kube2iam is informed of this event.
kube2iam loses its connection to the apiserver for some reason
The pod finishes its deletion process but kube2iam does not get informed of this
Another pod is scheduled and is assigned the same IP as the old pod
The new pod requests credentials and kube2iam incorrectly gives it the old pod's credentials

I'm not familiar enough with the informer code to know how viable of a situation this is, and I suppose this scenario could still happen with #173 as-is if the old pod finished deleting before its DeletionTimestamp and a new pod was scheduled before the old pod's deletion timestamp.

If we think this situation is not worth handling then I suppose just dropping any DeletionTimestamp check altogether in isPodActive would be the way to go.

Can you explain how #173 is causing rogue index entries to be left behind? Shouldn't the informer be receiving the pod deletion event and removing it from the index altogether regardless of the return value of isPodActive?

lbernail commented 5 years ago

@rifelpet : after looking into the code with @bpineau and performing a few tests, it appears that you need the indexFunc to return a key to delete an item from the index. If the indexFunc returns nil on a delete (which will happen every time a pod is deleted after the graceful period in #173), the entry matching the former pod ip in the index will never be removed. So when a new pod is created with the same IP the index will include the old pod in the set but this pod is not in the cache anymore.

Regarding disconnections from the apiserver, I believe the informer handles this situation well and makes sure that the cache remains consistent (by performing a re-list if needed). The index will remain consistent as long as deletes can modify it accordingly (which requires the indexFunc to return the correct key).

We are currently testing a custom build that removes any test on the deletionTimestamp and things look good so far. We are not completely sure that on delete events the pod object will always contains its IP but we are monitoring this and will give you feedback. An easy way to confirm this would be to dump the index after deletions but there is no way to do that with client-go today. We may create an issue to add the ability to retrieve the full index (or just to a custom build for debugging purposes).

The deletion logic in client-go is there: https://github.com/kubernetes/client-go/blob/master/tools/cache/thread_safe_store.go#L278-L298

If indexFunc returns nil, indexValues is nil and the key is never removed from the set.

lbernail commented 5 years ago

Quick update: we have been running a forked of the code with this for isPodActive:

    return p.Status.PodIP != "" &&
        v1.PodSucceeded != p.Status.Phase &&
        v1.PodFailed != p.Status.Phase

We haven't had any issue so far.

We still need to test for PodSucceeded and PodFailed because pods in this status keep their IP in etcd but it can be reallocated by the runtime/cni because the kubelet has deleted the sandbox. In this case, we catch an update with the phase change which means the IP from the previous status will be deleted from the index (https://github.com/kubernetes/client-go/blob/master/tools/cache/thread_safe_store.go#L253-L255) and since indexFunc will return nil nothing will be added back.

Similarly, we've seen some update to the pod status where the IP is removed before the deletion event. Since indexFunc also returns nil forpodIP=="" this means this update will also remove the IP the IP from the index.

As an extra precaution, we could add some logic in PodByIP to check that the pod is not nil (which could happen if the index end up in an inconsistent state, which should not happen except maybe in some edge cases such as out of order events) and that it is not Succeeded or Failed. We would have to do that by filtering over all pods in the set (in case we have a former Succeeded pod and the Running one indexed at the same IP for instance).

lbernail commented 5 years ago

@rifelpet I think I know why #173 creates this issue:

Before 173

Controller or user deletes pod
kubelet update the pod object to set deletion timestamp and starts graceful period
kube2iam get the update event
ispodactive returns nil (deletion timestamp set)
indexFunc returns nil
updateindices (Indexer code) deletes the old pod from the index (indexFunc still returns an IP for the old resource) and does not add the new one (indexFunc has returned nil). After this the pod is no longer in the index

After 173

Controller or user deletes pod
kubelet update the pod object to set deletion timestamp and starts graceful period
kube2iam get the update event
ispodactive returns ip (deletion timestamp set but after now())
indexFunc returns ip
updateIndices remove old object and add the new one (we still have an IP)
when the grace period expires, the kubelet deletes the pod and triggers a delete event
kube2iam gets the delete event
ispodactive returns nil (now() > deletion timestamp)
indexFunc returns nil
deleteFromIndices can't remove the pod from the index because the indexFunc has returned nil so the stale object remains in the index

This means this issue should only happen when the kubelet delete event is triggered after graceful termination expires (which probably means only for pods which were forcefully deleted because they were still running after the graceful period)

rifelpet commented 5 years ago

That makes sense to me, thanks for the investigation @lbernail.

I opened #203 to match the change you suggested. I will test it on our clusters as well.

One thing worth noting is that users were experiencing this nil pod issue before #173, albeit much less frequency. Do we have any idea why that might have been happening?

lbernail commented 5 years ago

@rifelpet Thank you for the PR. I can't easily think of a scenario where this would happen before #173 : it would be the pod has been deleted from the cache and the index is stale. I wonder if there could be a race condition in the deletion event:

indexer deletion is triggered
pod is deleted from the cache
we receive a get credential request
indexer deletion finishes

I'll need to dig into client-go to this if this can happen

A few ideas to work around the issue:

update client-go to the latest version (a few fixes/improvements in the indexer in more recent version)
verify that the pod from returned by the index is not nil. This could help for race conditions when we try to get a pod during its deletion (the scenario I described above) but would not solve the problem is the index is out of sync
since there is no way to force an index resync (once it has invalid information from an IP it will never be cleaned) another option would be to change the recover strategy in the http handler (simplest option: not recover at all, if kube2iam panics because of an out of sync index it will crash and restart with a clean one)

lbernail commented 5 years ago

@rifelpet I have an idea for the issue before #173 I think the issue could happen on force deletes (I haven't tested it though)

cdenneen commented 5 years ago

EKS v1.11.5

with 0.10.6 I was getting:

time="2019-04-08T15:46:45Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.100.10.101 res.status=500
time="2019-04-08T15:46:53Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.100.10.101 res.status=500
time="2019-04-08T15:47:18Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.100.10.101 res.status=500

with 0.10.7-rc2 I was getting:

time="2019-04-08T15:52:32Z" level=error msg="Error assuming role AccessDenied: Access denied\n\tstatus code: 403, request id: 52181169-5a16-11e9-94f6-cb571d5b30f7" ns.name=kube-system pod.iam.role="arn:aws:iam::XXXXXXXXXXXXX:role/k8scluster-alb-controller" req.method=GET req.path=/latest/meta-data/iam/security-credentials/k8scluster-alb-controller req.remote=10.100.10.101
time="2019-04-08T15:52:32Z" level=info msg="GET /latest/meta-data/iam/security-credentials/k8scluster-alb-controller (500) took 10697711.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/k8scluster-alb-controller req.remote=10.100.10.101 res.duration=1.0697711e+07 res.status=500

with 0.10.7-rc3 I am getting:

time="2019-04-08T15:54:42Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2805099146.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.240.63.122 res.duration=2.805099146e+09 res.status=404

so rc3 is looking more promising but getting 404 response

lbernail commented 5 years ago

@cdenneen we need #203 to be merged to get the complete fix I can provide a version with the patch (which we've been running for a while) but not based on an official release

cdenneen commented 5 years ago

@lbernail thanks... I thought the 2 rc's (rc2 and rc3) were to test this. Is there an image pushed to hub @jtblin @jrnt30 that i can test for this prior to merge/release?

lbernail commented 5 years ago

@cdenneen you can test with lbernail/kube2iam:0.10.6-patched It's based on 0.10.6 with the patch for graceful termination (built from https://github.com/DataDog/kube2iam/commits/0.10.6-dd-1)

cdenneen commented 5 years ago

@lbernail thanks... working perfectly... look forward to #203 merge

mbelang commented 4 years ago

Is this really fixed? I still have issue with 0.10.7 version. I had no issues with 0.10.4....

owengo commented 4 years ago

If someone still has the error:

  File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

I did set the env variable AWS_METADATA_SERVICE_TIMEOUT=3 in the pod and it solves the problem. What happens sometimes is that with kube2iam the metadata connexion takes more than 1 second to answer, so the default timeout for metadata service is reached and since there is no other authentication method we have this error.

djackson-exo-inc commented 3 years ago

Any updates on this? This is still a pretty serious issue for us. We already have the AWS_METADATA_SERVICE_TIMEOUT=3 set for the relevant pods, but still get this issue pretty often. We are using 0.10.7

jtblin / kube2iam

Interface conversion fails, causing pods not to get credentials. #178

Before 173

After 173