Open kinghajj opened 5 years ago
I've seen this same issue on Kubernetes 1.10.9 on a Kops-provisioned cluster. Using the latest 0.10.4 image. Very intermittent, I haven't been able to create a guaranteed repro case.
There are only 2 cases of *v1.Pod
type assertion:
https://github.com/jtblin/kube2iam/blob/master/k8s/k8s.go#L100-L106
The k8s.io/client-go
library is pinned to a version over 20 months old and there have been many major releases since then. Perhaps its worth updating?
We have seen this error as well. We upvoted to have this pulled in as well.
getting this on EKS 1.11.5
I will look at updating tomorrow
@jrnt30 Can we make sure to get a new build to? We will start experimenting whenever you release.
rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)
You only started seeing this is 0.10.6? Logs would help. Are the failures for running or deleted pods?
On Sat, Jan 26, 2019 at 15:26 Sergii Shapovalenko notifications@github.com wrote:
rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jtblin/kube2iam/issues/178#issuecomment-457867339, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWanESWzlvvkiCszNJB3ewhKPaocHUjks5vHMf9gaJpZM4YrwBM .
@jrnt30 Yes, I used "latest" tag for docker image. this only affected the pods on the nodes that were created after 0.10.6 release. Then i set tag 0.10.4 and error disappeared. \ I was getting
aws sts get-caller-identity
Unable to locate credentials. You can configure credentials by running "aws configure".
We were on 10.4 and were having the error where pod could not get creds after term. After the update to 10.6 that went away for the application that has very variable scaling and has been very stable. Another application, that uses kubernetes jobs on static non-autoscaling boxes was having intermittent errors
"unable to sign request without credentials set"
with error in kube2iam. We had seen this in 10.4 though.
{"level":"error","msg":"PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod","req.method":"GET","req.path":"/latest/meta-data/iam/security-credentials/","req.remote":"100.96.192.4","res.status":500,"time":"2019-01-26T21:30:33Z"}.
This would not be related to deleted pods in our case for the fix in 10.6. Running with kops deploy on v1.10.11
rolled back to 0.10.4 error is disappeared. i won't be using "latest" i won't be using "latest" repaet 10 times :)
Found out the same today, thanks to your comment I realized that I did the same.
Saw this as well in ours EKS.
Ditto. EKS 1.10.11
Also seeing this on EKS Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
, roll back to 0.10.4
fixes it.
The same on AWS EKS Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.11-eks", GitCommit:"6bf27214b7e3e1e47dce27dcbd73ee1b27adadd0", GitTreeState:"clean", BuildDate:"2018-12-04T13:33:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Did we fix this in the latest image? We are seeing this in EKS.
Apologies for the delay, I have not provisioned nor have a current cluster where I have reproduced this. Is this emergent behavior JUST from the 0.10.6 upgrade or was there also a cluster upgrade to 1.11 from a previous version?
Can anyone send a set of debug
level logs that have this issue captured? What is not totally clear is if this interface conversation is occurring for newly deleted pod, new pods coming into the cluster, etc.
The PR that may be causing some of this #173 changed the semantics of what an active
pod is. There was some work done to:
Simply adjusting these and "hoping" that this fixes it seems somewhat lazy and dangerous to me, so I any additional information we could get about the state of the cluster from people actually running into the issue would be extremely helpful.
I released an RC tag to Docker Hub as 0.10.7-rc2
with the changes referenced in #189 (a merge of several different efforts).
As noted in my previous comment, I am unsure if this will address our interface conversion errors. They seem to be happening intermittently on 0.10.4
and more frequently for EKS users on 0.10.6
. If we could get a few people to run the RC version to determine if it alleviates the issue that would help determine the best path forward.
I've just run the RC version and always seeing intermittent error:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
time="2019-02-05T15:46:25Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:25Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-hxvqg pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:26Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:27Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:28Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:28Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2173819697.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.173819697e+09 res.status=404
time="2019-02-05T15:46:29Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:29Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:30Z" level=debug msg="Pod OnUpdate" pod.iam.role= pod.name=kube-downscaler-1549307100-rnj4x pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Pending
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:31Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:31Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2207612285.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.207612285e+09 res.status=404
time="2019-02-05T15:46:44Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:46Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:46Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:47Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/kube2iam-cluster-aws-client-france-fr-allow-ecr" pod.name=refresh-creds-7465d56d76-qccvt pod.namespace=default pod.status.ip=XXX.XXX.XXX.XXX pod.status.phase=Running
time="2019-02-05T15:46:48Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T15:46:48Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2023309635.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.XXX.XXX.XXX res.duration=2.023309635e+09 res.status=404```
At your disposal for further information.
Thanks for taking the time to collect those logs! If you have found a way to replicate this it would be very much appreciated.
A few questions:
Pod OnUpdate
statements related to that pod? Also if you have the /debug
endpoint enabled it might help us cross-reference the issue.
I will try and dig into the pod indexer logic to see if I can come up with anything but realistically it will be a few days.
Thanks again!
What platform and OS are running into this on? Kubernetes v1.10.12 Debian GNU/Linux 9 (stretch) 4.9.0-8-amd64 docker://17.3.2 Provisioned with KOPS 1.11
Would you mind masking the pod IPs in a way that we can see a relationship between the log entries?
kube2iam's log when not working:
time="2019-02-05T21:34:43Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:43Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T21:34:45Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:51Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:51Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-rwzj9 pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T21:34:53Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:53Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[0] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[1] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[2] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[4] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=error msg="pods[5] is not *v1.Pod! <nil>"
time="2019-02-05T21:34:54Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2653312418.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=2.653312418e+09 res.status=404
my pod's log when not working:
kubectl logs -n default -l run=refresh-creds -p
2019-02-05 21:34:45 INFO Loading configuration from cluster: kubernetes.default
2019-02-05 21:34:51 INFO doing something in a loop ...
2019-02-05 21:34:51 INFO Test logging #1
2019-02-05 21:34:52 INFO Client STS #1: <botocore.client.STS object at 0x7f1c1c36efd0>
2019-02-05 21:34:52 INFO ending loop ...
Traceback (most recent call last):
File "/app/main.py", line 246, in <module>
loop_controller()
File "/app/main.py", line 238, in loop_controller
raise e
File "/app/main.py", line 220, in loop_controller
docker, expire_at = ecr_login()
File "/app/main.py", line 85, in ecr_login
session = assume_role(ROLE_ARN, "default") # type: Session
File "/app/main.py", line 67, in assume_role
account_id = clt.get_caller_identity()["Account"]
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 648, in _make_api_call
operation_model, request_dict, request_context)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 667, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
operation_name=operation_model.name)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
auth.add_auth(request)
File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
debug/store endpoint when not working:
curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
"namespaceByIP": {},
"rolesByIP": {},
"rolesByNamespace": {
"222-iop-correction-qual-ksl": null,
.....
"weave": null
}
}
kube2iam log when working:
time="2019-02-05T18:00:13Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:13Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip= pod.status.phase=Pending
time="2019-02-05T18:00:16Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 36243.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=36243 res.status=200
time="2019-02-05T18:00:22Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:00:22Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 427731917.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=4.27731917e+08 res.status=200
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:00:36Z" level=debug msg="Pod OnDelete" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-bmmzw pod.namespace=default pod.status.ip=XXX.YYY.0.3 pod.status.phase=Running
time="2019-02-05T18:01:50Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:52Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" pod.name=refresh-creds-6c6cb8b588-44zvk pod.namespace=default pod.status.ip=XXX.YYY.0.1 pod.status.phase=Running
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (200) took 32940.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=XXX.YYY.0.1 res.duration=32940 res.status=200
time="2019-02-05T18:01:58Z" level=debug msg="retrieved credentials from sts endpoint: sts.amazonaws.com" ns.name=default pod.iam.role="arn:aws:iam::ACCOUNT_ID:role/my-role" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1
time="2019-02-05T18:01:58Z" level=info msg="GET /latest/meta-data/iam/security-credentials/my-role (200) took 75346.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/my-role req.remote=XXX.YYY.0.1 res.duration=75346 res.status=200
my pod log when working:
kubectl logs -n default -l run=refresh-creds -p
2019-02-05 18:05:42 INFO Loading configuration from cluster: kubernetes.default
2019-02-05 18:05:48 INFO doing something in a loop ...
2019-02-05 18:05:48 INFO Test logging #1
2019-02-05 18:05:48 INFO Client STS #1: <botocore.client.STS object at 0x7fd671f182b0>
2019-02-05 18:06:13 INFO Account ID #1: ACCOUNT_ID
2019-02-05 18:06:39 INFO Account ID #2: ACCOUNT_ID_2
2019-02-05 18:06:39 INFO ECR CLIENT: <botocore.client.ECR object at 0x7fd671d7e240>
debug/store endpoint when working:
curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
"namespaceByIP": {
"XXX.YYY.0.1": "default"
},
"rolesByIP": {
"XXX.YYY.0.1": "arn:aws:iam::ACCOUNT_ID:role/my-role"
},
"rolesByNamespace": {
"222-iop-correction-qual-ksl": null,
....
"weave": null
}
}
Are the pods that are failing "stable" or are they undergoing lifecycle changes? They are transiting from Pending to Running. But that also seems to fail when running.
Are all of these GETs for the same pod? The logs in this comment are captured from a tainted node. There is only few daemonset running on this node and only my app's pod running why a kube2iam annotation role.
Are the Pod OnUpdate
statements related to that pod?
Yes, for this comment.
I'm getting the same error on 2 different EKS clusters. I had this error on k8s nodes 1.10.3 and now 1.10.11. This seems to work fine for some period of time and then all newly created pods fail. I.e. after 24 hours or so of a node running then newly fetched credentials fail.
We had one incident today with the 0.10.6, downgrading back to 0.10.4. Running EKS with 1.10.3.
I have an entrypoint script in the pod which polls the kube2iam metadata api until the credentials are associated and executes the actual pod command only after that. In our single case this bootstrap script never finished, so the bug was there "right from the pod startup" from my point of view.
I've implement a test self-destruction logic in my application POD (for testing). If it's not able to get credentials, it self-destructs.
And some times, this works...
But most of the time, it works immediately after kube2iam pod himself.
Unfortunately, I have to consider to switch for kiam.
We've began noticing issues with 0.10.6 as well. Downgrading to 0.10.4 seems to work.
Can also confirm in EKS seeing this issue, going from tag:latest to tag:0.10.4 100% fixed this for us.
Just experienced this on kube2iam 0.10.6, NOT running in EKS but with a kops cluster using weave.
Most of our deployments were okay, but two of them were getting 500's from kube2iam, kube2iam logs had a similar message for these 2 pods. What I noticed is that these are our deployments that are deployed with strategy Recreate and only 1 replica, which happens to keep the same pod IP address on the new pods. Is there a chance the pod IP lookup is broken? We ran for almost 10 minutes this way, and I was able to recover by deleting the kube2iam pods on the node of my affected pods (which would refresh whatever pod cache exists). Then I downgraded kube2iam to 0.10.4 and did another release without issues.
I released an 0.10.7-rc3 release that reverts the logic that was introduced in #173 which I believe is the main culprit.
The observation about the Recreate
may be a good lead and something to try however I am at a wedding this weekend and unable to invest enough time to test some of those scenarios (forgot my laptop charger). I'm going to push the next RC release to see if that resolves some of people's issues.
I've the same error with 0.10.7-rc3. I can't even get it works intermittently.
I've notice that error
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found
When enabling verbose DEBUG log on my application, I've notice socket timeout.
2019-02-09 10:51:52 DEBUG Looking for credentials via: boto-config
2019-02-09 10:51:52 DEBUG Looking for credentials via: container-role
2019-02-09 10:51:52 DEBUG Looking for credentials via: iam-role
2019-02-09 10:51:53 DEBUG Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/meta-data/iam/security-credentials/: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/botocore/httpsession.py", line 258, in send
decode_content=False,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 343, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: AWSHTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=1)
During handling of the above exception, another exception occurred:
The response time is pretty hight:
time curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
pod with specificed IP not found
real 0m2.278s
user 0m0.004s
sys 0m0.000s
And the debug/store:
curl -s http://127.0.0.1:8181/debug/store | jq '.'
{
"namespaceByIP": {},
"rolesByIP": {},
"rolesByNamespace": {
But, with kube2iam version 0.10.4:
debug/store is correctly populate:
{
"namespaceByIP": {
"100.114.0.1": "default",
"100.114.0.2": "default"
},
"rolesByIP": {
"100.114.0.1": "my-role",
"100.114.0.2": ""
},
"rolesByNamespace": {
My app retreive the role's creds:
2019-02-09 10:56:52 DEBUG Found credentials from IAM Role: my-role
2019-02-09 10:56:52 DEBUG Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-02-09 10:56:52 DEBUG Event choose-service-name: calling handler <function handle_service_name_alias at 0x7f537d039598>
2019-02-09 10:56:52 DEBUG Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/sts/2011-06-15/service-2.json
2019-02-09 10:56:52 DEBUG Event creating-client-class.sts: calling handler <function add_generate_presigned_url at 0x7f537d072e18>
2019-02-09 10:56:52 DEBUG The s3 config key is not a dictionary type, ignoring its value of: None
2019-02-09 10:56:52 DEBUG Setting sts timeout as (60, 60)
2019-02-09 10:56:52 DEBUG Loading JSON file: /usr/local/lib/python3.7/site-packages/botocore/data/_retry.json
2019-02-09 10:56:52 DEBUG Registering retry handlers for service: sts
2019-02-09 10:56:52 INFO Client STS #1: <botocore.client.STS object at 0x7f537b11c198>
2019-02-09 10:56:52 DEBUG Event before-parameter-build.sts.GetCallerIdentity: calling handler <function generate_idempotent_uuid at 0x7f537d055b70>
2019-02-09 10:56:52 DEBUG Event before-call.sts.GetCallerIdentity: calling handler <function inject_api_version_header_if_needed at 0x7f537cfd9a60>
2019-02-09 10:56:52 DEBUG Making request for OperationModel(name=GetCallerIdentity) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.86 Python/3.7.2 Linux/4.9.0-8-amd64 Botocore/1.12.86'}, 'body': {'Action': 'GetCallerIdentity', 'Version': '2011-06-15'}, 'url': 'https://sts.amazonaws.com/', 'context': {'client_region': 'aws-global', 'client_config': <botocore.config.Config object at 0x7f537b11c470>, 'has_streaming_input': False, 'auth_type': None}}
2019-02-09 10:56:52 DEBUG Event request-created.sts.GetCallerIdentity: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7f537b11c438>>
2019-02-09 10:56:52 DEBUG Event choose-signer.sts.GetCallerIdentity: calling handler <function set_operation_specific_signer at 0x7f537d055a60>
2019-02-09 10:56:52 DEBUG Calculating signature using v4 auth.
2019-02-09 10:56:52 DEBUG CanonicalRequest:
POST
Saw this today on EKS GitVersion:"v1.10.11-eks"
. The kube2iam
pod was running 0.10.4
. I killed the pod, and the daemonset brought another back, this time with 0.10.6
and it is doing what it is supposed to.
All the other worker nodes running kube2iam
are "older" than the one that had the issue. It almost feels like AWS is ripping out something from under kube2iam
and it gets in this state?
We are experiencing this issue as well: EKS 1.10.11, aws-vpc-cni-k8s 1.30 0.10.4 seems to resolve the issue. Occurred on 0.10.6
I found this issue because we encountered the issue solved by #173 (no credentials for pods in the terminating phase). Since the original issue has been hard on us, I think I'll try to cherry-pick #173 on 0.10.4. I'll let you know if it works or not for us (at least we could get a confirmation that the root cause is #173)
OK we tried 0.10.4 with #173 The fix for terminating pods works great but we've also started to see the interface condition issue. It's hard to correlate it with other events from debug logs (I see a "Pod OnUpdate" event and 8 minutes later I start getting the error.
I'm dumping the content of /debug/store every 5mn on a node where I know the problem happens regularly and I'll see if I can associate the store becoming empty to a specific event.
If it's not enough, I'll probably do a custom build with additional debug info (if anyone has suggestions, feel free to mention them)
The error is happening very rarely with 0.10.4 and the termination patch I also got the error on the debug/store route at one point but only for a single pod IP at a time so it seems the index sometimes has invalid IP keys. I wonder if there is a guarantee that Indexers and Informer content are in sync
Seems we should remove the DeletionTimestamp tests in isPodActive()
? :
Pod's DeletionTimestamp is set when the pod deletion process begins, and expires (podDeleted=true in isPodActive()) when the pod is effectively deleted, after terminationGracePeriodSeconds.
The deletion event is received by our informer on pod's deletion final step, when DeletionTimestamp is expired.
So the timestamp tests in isPodActive() are of no use during most of pod's lifecycle, and have an adverse effect on deletions events, which are ignored (due to the !podDeleted test) and leave rogue index entries behind. Which would cause panics when acceded (eg. by reaching /debug/store).
My reasoning for checking the value of the timestamp rather than just ignoring the field altogether was to avoid this potential security risk:
I'm not familiar enough with the informer code to know how viable of a situation this is, and I suppose this scenario could still happen with #173 as-is if the old pod finished deleting before its DeletionTimestamp and a new pod was scheduled before the old pod's deletion timestamp.
If we think this situation is not worth handling then I suppose just dropping any DeletionTimestamp
check altogether in isPodActive
would be the way to go.
Can you explain how #173 is causing rogue index entries to be left behind? Shouldn't the informer be receiving the pod deletion event and removing it from the index altogether regardless of the return value of isPodActive
?
@rifelpet : after looking into the code with @bpineau and performing a few tests, it appears that you need the indexFunc to return a key to delete an item from the index. If the indexFunc returns nil on a delete (which will happen every time a pod is deleted after the graceful period in #173), the entry matching the former pod ip in the index will never be removed. So when a new pod is created with the same IP the index will include the old pod in the set but this pod is not in the cache anymore.
Regarding disconnections from the apiserver, I believe the informer handles this situation well and makes sure that the cache remains consistent (by performing a re-list if needed). The index will remain consistent as long as deletes can modify it accordingly (which requires the indexFunc to return the correct key).
We are currently testing a custom build that removes any test on the deletionTimestamp and things look good so far. We are not completely sure that on delete events the pod object will always contains its IP but we are monitoring this and will give you feedback. An easy way to confirm this would be to dump the index after deletions but there is no way to do that with client-go today. We may create an issue to add the ability to retrieve the full index (or just to a custom build for debugging purposes).
The deletion logic in client-go is there: https://github.com/kubernetes/client-go/blob/master/tools/cache/thread_safe_store.go#L278-L298
If indexFunc
returns nil, indexValues
is nil and the key is never removed from the set.
Quick update: we have been running a forked of the code with this for isPodActive:
return p.Status.PodIP != "" &&
v1.PodSucceeded != p.Status.Phase &&
v1.PodFailed != p.Status.Phase
We haven't had any issue so far.
We still need to test for PodSucceeded and PodFailed because pods in this status keep their IP in etcd but it can be reallocated by the runtime/cni because the kubelet has deleted the sandbox. In this case, we catch an update with the phase change which means the IP from the previous status will be deleted from the index (https://github.com/kubernetes/client-go/blob/master/tools/cache/thread_safe_store.go#L253-L255) and since indexFunc will return nil nothing will be added back.
Similarly, we've seen some update to the pod status where the IP is removed before the deletion event. Since indexFunc also returns nil forpodIP==""
this means this update will also remove the IP the IP from the index.
As an extra precaution, we could add some logic in PodByIP
to check that the pod is not nil (which could happen if the index end up in an inconsistent state, which should not happen except maybe in some edge cases such as out of order events) and that it is not Succeeded or Failed. We would have to do that by filtering over all pods in the set (in case we have a former Succeeded pod and the Running one indexed at the same IP for instance).
@rifelpet I think I know why #173 creates this issue:
This means this issue should only happen when the kubelet delete event is triggered after graceful termination expires (which probably means only for pods which were forcefully deleted because they were still running after the graceful period)
That makes sense to me, thanks for the investigation @lbernail.
I opened #203 to match the change you suggested. I will test it on our clusters as well.
One thing worth noting is that users were experiencing this nil pod issue before #173, albeit much less frequency. Do we have any idea why that might have been happening?
@rifelpet Thank you for the PR. I can't easily think of a scenario where this would happen before #173 : it would be the pod has been deleted from the cache and the index is stale. I wonder if there could be a race condition in the deletion event:
I'll need to dig into client-go to this if this can happen
A few ideas to work around the issue:
@rifelpet I have an idea for the issue before #173 I think the issue could happen on force deletes (I haven't tested it though)
EKS v1.11.5
with 0.10.6 I was getting:
time="2019-04-08T15:46:45Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.100.10.101 res.status=500
time="2019-04-08T15:46:53Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.100.10.101 res.status=500
time="2019-04-08T15:47:18Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.100.10.101 res.status=500
with 0.10.7-rc2 I was getting:
time="2019-04-08T15:52:32Z" level=error msg="Error assuming role AccessDenied: Access denied\n\tstatus code: 403, request id: 52181169-5a16-11e9-94f6-cb571d5b30f7" ns.name=kube-system pod.iam.role="arn:aws:iam::XXXXXXXXXXXXX:role/k8scluster-alb-controller" req.method=GET req.path=/latest/meta-data/iam/security-credentials/k8scluster-alb-controller req.remote=10.100.10.101
time="2019-04-08T15:52:32Z" level=info msg="GET /latest/meta-data/iam/security-credentials/k8scluster-alb-controller (500) took 10697711.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/k8scluster-alb-controller req.remote=10.100.10.101 res.duration=1.0697711e+07 res.status=500
with 0.10.7-rc3 I am getting:
time="2019-04-08T15:54:42Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2805099146.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=10.240.63.122 res.duration=2.805099146e+09 res.status=404
so rc3 is looking more promising but getting 404 response
@cdenneen we need #203 to be merged to get the complete fix I can provide a version with the patch (which we've been running for a while) but not based on an official release
@lbernail thanks... I thought the 2 rc's (rc2 and rc3) were to test this. Is there an image pushed to hub @jtblin @jrnt30 that i can test for this prior to merge/release?
@cdenneen you can test with lbernail/kube2iam:0.10.6-patched
It's based on 0.10.6 with the patch for graceful termination
(built from https://github.com/DataDog/kube2iam/commits/0.10.6-dd-1)
@lbernail thanks... working perfectly... look forward to #203 merge
Is this really fixed? I still have issue with 0.10.7 version. I had no issues with 0.10.4....
If someone still has the error:
File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
auth.add_auth(request)
File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I did set the env variable AWS_METADATA_SERVICE_TIMEOUT=3 in the pod and it solves the problem. What happens sometimes is that with kube2iam the metadata connexion takes more than 1 second to answer, so the default timeout for metadata service is reached and since there is no other authentication method we have this error.
Any updates on this? This is still a pretty serious issue for us. We already have the AWS_METADATA_SERVICE_TIMEOUT=3
set for the relevant pods, but still get this issue pretty often. We are using 0.10.7
Sometimes, pods in my cluster fail with errors like "fatal error: Unable to locate credentials". Checking the kube2iam logs for the instance on the same node, I see entries like this:
This occurs on an EKS cluster running Kubernetes 1.10.3.