Closed ghost closed 5 years ago
Thanks for raising this.
Yep- most likely due to the delay between the pod being picked up by the watcher. You should have seen some more retry messages from the server/agent as it attempted to keep checking for the pod? These should keep happening until the client disconnects.
The prefetch buffer controls the number of new pods that, having been started, will be passed to the prefetcher to request credentials- the error you've reported would be before the client requests credentials so I don't think it would help.
How many pods are being scheduled at the same time? The other thing you could do is turn on debug logging (just as you schedule everything- there'll be a lot of noise), you should see a load of messages with processing delta
and the pod details- if you get the same error could you check to see whether it's just a timing issue with receiving the pod data?
Could you also give me any details about the way you've got kiam deployed (ie. how many server processes you run, are they also on nodes deployed as spot instances etc.)
Thank you.
Yeah, you're right, the server tries multiple times to find the pod, until eventually the error is propagated to the agent. So I have multiple server-side errors followed by a single agent-side error. After digging a little bit more in our Kibana, I found that the problematic pod actually got its credentials assigned by kiam:
{
"credentials.access.key": "***",
"credentials.expiration": "2018-03-15T19:06:48Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "13076366",
"time": "2018-03-15T18:58:48Z"
}
But it had no IP address yet so it was not written to the cache.
Each node runs ~50 pods, so once K8s starts draining it, these 50 pods have to be rescheduled elsewhere. The issue happens on our production cluster, so I have to act with extra care, but I'll try to change the log level to debug and see if I can get any insights.
We have 3 K8s master nodes that are ondemand instances and are running the kiam-server process which uses the default --fetchers
value(8).
The K8s nodes are a mix between ondemands and spots. The agent is running on all nodes including the masters(we have daemonsets that require kiam functionality).
Cool... so that "fetched credentials" message means that it was able to retrieve the credentials from AWS (most likely caused by the prefetcher given the pod is pending).
Were you able to add the debug logging on? I'd like to see all entries with the message "processing delta" for the pods that have problems. Every time a pod changes state it'd be tracked with that- if we don't see any messages for the pod(s) in question then something's being dropped for some reason. If we do get log messages then it's potentially a timing issue (i.e. the pod state updates are processes too slow relative to how quickly pods start and request credentials).
It's good to know that the server processes are running on the On Demand instances- when you next see the problem could you check the server processes aren't being restarted/killed and that they have all have a nice long uptime please.
I plan to enable debug logging this Sunday and then try to reproduce the issue. Regarding server pods' restarts - they've never restarted since we started using kiam. The agent pods have restarted/crashed multiple times, although I'm not sure these crashes correlate with the errors. But I'll doublecheck.
Ok cool- well that sounds most likely that there's a race between the pod starting (and requesting credentials) and how quickly pod data arrives from the watcher.
Please let me know what you find and post back here- really keen to improve this.
Just to let you know that I've set kiam-server's log level to debug and terminated multiple K8s nodes. But the issue didn't occur. I even tried to terminate multiple nodes at once, but this didn't help either. It seems to be related either to overall cluster load(traffic we get on Sundays is usually low) or to kiam-server pods' uptime. I'll try to reproduce this again once we reach our peak traffic and will update with the findings.
Brilliant, thank you.
On Sun, 25 Mar 2018 at 14:43, Alex G. notifications@github.com wrote:
Just to let you know that I've set kiam-server's log level to debug and terminated multiple K8s nodes. But the issue didn't occur. I even tried to terminate multiple nodes at once, but this didn't help either. It seems to be related either to overall cluster load(traffic we get on Sundays is usually low) or to kiam-server pods' uptime. I'll try to reproduce this again once we reach our peak traffic and will update with the findings.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/uswitch/kiam/issues/47#issuecomment-375971559, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEflb-993vzhinrNA8mGLo6b_WTqG7ks5th58DgaJpZM4Ss6FN .
So here we go(this particular pod was not scheduled due to K8s node drain, but probably due to horizontal pod autoscaler kicking in to scale the deployment):
[
{
"cache.delta.type": "Added",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510380",
"time": "2018-03-25T15:38:09Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510380",
"time": "2018-03-25T15:38:09Z"
},
{
"cache.delta.type": "Added",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510380",
"time": "2018-03-25T15:38:09Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510389",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510389",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510389",
"time": "2018-03-25T15:38:10Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510393",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510393",
"time": "2018-03-25T15:38:10Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "",
"pod.status.phase": "Pending",
"resource.version": "17510393",
"time": "2018-03-25T15:38:10Z"
},
{
"level": "error",
"msg": "error finding pod: pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "error",
"msg": "error finding pod: pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:21Z"
},
{
"credentials.access.key": "ASI***QKQ",
"credentials.expiration": "2018-03-25T15:48:52Z",
"credentials.role": "***",
"generation.metadata": 0,
"level": "info",
"msg": "fetched credentials",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "100.115.1.182",
"pod.status.phase": "Running",
"resource.version": "17510549",
"time": "2018-03-25T15:38:24Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "announced pod",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "100.115.1.182",
"pod.status.phase": "Running",
"resource.version": "17510549",
"time": "2018-03-25T15:38:24Z"
},
{
"cache.delta.type": "Updated",
"cache.object": "pod",
"generation.metadata": 0,
"level": "debug",
"msg": "processing delta",
"pod.iam.role": "***",
"pod.name": "***-worker-556fcbd4b6-44kss",
"pod.namespace": "default",
"pod.status.ip": "100.115.1.182",
"pod.status.phase": "Running",
"resource.version": "17510549",
"time": "2018-03-25T15:38:24Z"
}
]
Pod not found
error messages, leaving only the first and the last ones.Interesting- so there's a handful of messages at the end where the phase goes from Pending
to Running
which are recorded after the "error finding pod: pod not found" errors. The sever returned two errors to the agent (the last at 2018-03-25T15:38:21Z
) and the server was informed of the pod's IP at 2018-03-25T15:38:24Z
.
It's interesting that there's only two errors- it suggests that the agent only attempted to find the pod twice- between 2018-03-25T15:38:18Z and 2018-03-25T15:38:21Z which is a pretty short retry period (I think the Java SDK has a 5 second period read timeout period).
I don't suppose you have the agent logs around the same time- I'd like to know when the client first requests credentials from the agent (causing it to fetch from the server) and when it errors, I'm expecting its only twice but it'd be useful to confirm.
Is there anything in particular you could share about this client process please- which language/platform/AWS SDK does it use?
I'll also try and do some more digging/tests and see if there's anything else that can be done to speed things up.
Thanks for all your help collecting the data!
Thanks a lot for you time and effort.
So first of all, just to correct you, and as I've already mentioned, I omitted most of the Pod not found
error messages, leaving only the first and the last ones just to give you a sense of timeline. There were ~30 such errors in total.
To your question, this app is running on ruby:2.2.7-slim
docker image and has aws-sdk v2.10.32 installed. The app has initialized Kinesis client on startup and then started publishing events to Kinesis stream, which resulted in errors.
Here're the agent logs:
[
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:18Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"addr": "100.115.1.182:60826",
"level": "error",
"method": "GET",
"msg": "error processing request: rpc error: code = Unknown desc = pod not found",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T16:38:19Z"
},
{
"addr": "100.115.1.182:60826",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
],
"X-Content-Type-Options": [
"nosniff"
]
},
"level": "info",
"method": "GET",
"msg": "processed request",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:19Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"addr": "100.115.1.182:32818",
"level": "error",
"method": "GET",
"msg": "error processing request: rpc error: code = Unknown desc = pod not found",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:20Z"
},
{
"addr": "100.115.1.182:32818",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
],
"X-Content-Type-Options": [
"nosniff"
]
},
"level": "info",
"method": "GET",
"msg": "processed request",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:20Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:21Z"
},
{
"level": "warning",
"msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
"pod.ip": "100.115.1.182",
"time": "2018-03-25T15:38:21Z"
},
{
"addr": "100.115.1.182:32998",
"level": "error",
"method": "GET",
"msg": "error processing request: rpc error: code = Unknown desc = pod not found",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:21Z"
},
{
"addr": "100.115.1.182:32998",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
],
"X-Content-Type-Options": [
"nosniff"
]
},
"level": "info",
"method": "GET",
"msg": "processed request",
"path": "/latest/meta-data/iam/security-credentials/",
"status": 500,
"time": "2018-03-25T15:38:21Z"
}
]
And just in case, here's the daemon.log from K8s node itself:
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.036847 2582 kubelet.go:1837] SyncLoop (ADD, "api"): "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.107776 2582 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-5fw7n" (UniqueName: "kubernetes.io/secret/85738579-3042-11e8-bace-0a2c434faade-default-token-5fw7n") pod "***-worker-556fcbd4b6-44kss" (UID: "85738579-3042-11e8-bace-0a2c434faade")
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.208308 2582 reconciler.go:257] operationExecutor.MountVolume started for volume "default-token-5fw7n" (UniqueName: "kubernetes.io/secret/85738579-3042-11e8-bace-0a2c434faade-default-token-5fw7n") pod "***-worker-556fcbd4b6-44kss" (UID: "85738579-3042-11e8-bace-0a2c434faade")
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.218310 2582 operation_generator.go:484] MountVolume.SetUp succeeded for volume "default-token-5fw7n" (UniqueName: "kubernetes.io/secret/85738579-3042-11e8-bace-0a2c434faade-default-token-5fw7n") pod "***-worker-556fcbd4b6-44kss" (UID: "85738579-3042-11e8-bace-0a2c434faade")
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.345331 2582 kuberuntime_manager.go:374] No sandbox for pod "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)" can be found. Need to start a new one
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.664 [INFO][18673] calico.go 87: Extracted identifiers ContainerID="ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657" Node="kube-v186-nodes-default-04c1" Orchestrator="k8s" Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: { Labels []struct { Key string "json:\"key\""; Value string "json:\"value\"" } "json:\"labels,omitempty\"" }{Labels:[]struct { Key string "json:\"key\""; Value string "json:\"value\"" }(nil)}}}}, EtcdScheme:"", EtcdKeyFile:"", EtcdCertFile:"", EtcdCaCertFile:""} Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.664 [INFO][18673] utils.go 250: Configured environment: [LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root LOGNAME=root USER=root SHELL=/bin/sh DAEMON_ARGS=--allow-privileged=true --cgroup-root=/ --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-custom-metrics=true --enable-debugging-handlers=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% --feature-gates=ExperimentalCriticalPodAnnotation=true --hostname-override=ip-172-16-119-196.fiverr.internal --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-labels=kops.k8s.io/instancegroup=nodes,kubernetes.io/role=node,node-role.kubernetes.io/node= --non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --register-schedulable=true --require-kubeconfig=true --v=2 --cloud-config=/etc/kubernetes/cloud.config --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/ CNI_COMMAND=ADD CNI_CONTAINERID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_NETNS=/proc/18588/ns/net CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=***-worker-556fcbd4b6-44kss;K8S_POD_INFRA_CONTAINER_ID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_IFNAME=eth0 CNI_PATH=/opt/calico/bin:/opt/cni/bin/ ETCD_ENDPOINTS=http://etcd-c.internal.kube-v186.fiverr.internal:4001,http://etcd-d.internal.kube-v186.fiverr.internal:4001,http://etcd-e.internal.kube-v186.fiverr.internal:4001 KUBECONFIG=/etc/cni/net.d/calico-kubeconfig K8S_API_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJjYWxpY28tdG9rZW4ta21sejIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIj
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.674 [INFO][18673] k8s.go 67: Extracted identifiers for CmdAddK8s Node="kube-v186-nodes-default-04c1" Orchestrator="k8s" Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.710480 2582 kubelet.go:1871] SyncLoop (PLEG): "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)", event: &pleg.PodLifecycleEvent{ID:"85738579-3042-11e8-bace-0a2c434faade", Type:"ContainerStarted", Data:"ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657"}
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.802 [INFO][18748] utils.go 250: Configured environment: [LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root LOGNAME=root USER=root SHELL=/bin/sh DAEMON_ARGS=--allow-privileged=true --cgroup-root=/ --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-custom-metrics=true --enable-debugging-handlers=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% --feature-gates=ExperimentalCriticalPodAnnotation=true --hostname-override=ip-172-16-119-196.fiverr.internal --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-labels=kops.k8s.io/instancegroup=nodes,kubernetes.io/role=node,node-role.kubernetes.io/node= --non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --register-schedulable=true --require-kubeconfig=true --v=2 --cloud-config=/etc/kubernetes/cloud.config --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/ CNI_COMMAND=ADD CNI_CONTAINERID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_NETNS=/proc/18588/ns/net CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=***-worker-556fcbd4b6-44kss;K8S_POD_INFRA_CONTAINER_ID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_IFNAME=eth0 CNI_PATH=/opt/calico/bin:/opt/cni/bin/ ETCD_ENDPOINTS=http://etcd-c.internal.kube-v186.fiverr.internal:4001,http://etcd-d.internal.kube-v186.fiverr.internal:4001,http://etcd-e.internal.kube-v186.fiverr.internal:4001 KUBECONFIG=/etc/cni/net.d/calico-kubeconfig K8S_API_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJjYWxpY28tdG9rZW4ta21sejIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIj
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.803 [INFO][18748] calico-ipam.go 177: Auto assigning IP Workload="default.***-worker-556fcbd4b6-44kss" assignArgs=client.AutoAssignArgs{Num4:1, Num6:0, HandleID:(*string)(0xc4203139f0), Attrs:map[string]string(nil), Hostname:"kube-v186-nodes-default-04c1", IPv4Pools:[]net.IPNet{}, IPv6Pools:[]net.IPNet{}}
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.814 [INFO][18748] ipam.go 816: Creating new handle: default.***-worker-556fcbd4b6-44kss
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.823 [INFO][18748] calico-ipam.go 205: IPAM Result Workload="default.***-worker-556fcbd4b6-44kss" result.IPs=[]*current.IPConfig{(*current.IPConfig)(0xc4200ecba0)}
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.825 [INFO][18673] k8s.go 264: Populated endpoint Workload="default.***-worker-556fcbd4b6-44kss" endpoint=WorkloadEndpoint(Node=kube-v186-nodes-default-04c1, Orchestrator=k8s, Workload=default.***-worker-556fcbd4b6-44kss, Name=eth0)
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.862 [INFO][18673] k8s.go 288: Added Mac, interface name, and active container ID to endpoint Workload="default.***-worker-556fcbd4b6-44kss" endpoint=WorkloadEndpoint(Node=kube-v186-nodes-default-04c1, Orchestrator=k8s, Workload=default.***-worker-556fcbd4b6-44kss, Name=eth0, ActiveInstanceID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657)
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.866 [INFO][18673] k8s.go 296: Wrote updated endpoint to datastore Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:14 ip-172-16-119-196 kubelet[2582]: I0325 15:38:14.813701 2582 kubelet.go:1871] SyncLoop (PLEG): "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)", event: &pleg.PodLifecycleEvent{ID:"85738579-3042-11e8-bace-0a2c434faade", Type:"ContainerStarted", Data:"af7d7945a1d1420a6cbf33edf9ac036a384f37155f3c934d3ee6a5a01312175d"}
Thanks- sorry I misunderstood about your first/last messages before. The warnings are all retries within the agent, the "processed request" messages would suggest that the client is retrying twice?
I can't see anything that immediately jumps out as strange- the time that the agent retries seems shorter than it should be (it's configured to retry for up to 5 seconds) and there's no messages about a cancelled context, maybe it's just the clocks in the log messages being a little out.
Do you have any prometheus metrics around the api server response times- be good to see whether there's any variability there (slow responses etc.) to see whether something is limiting how quickly the watcher can forward data to the server's pod cache. We had quite high 95th percentiles from the apiservers at one point because we'd forgotten to set requests and limits- it was just picking up our cluster defaults- increasing those improved performance quite significantly, both making it faster and more consistent.
I'll make a note to add some tests and benchmarks around the code that processes watch deltas to make sure there isn't anything that blocks during processing deltas- I don't think there is but it'd be better to prove it.
I'll keep thinking about other things that could be causing slow updates- it definitely feels like that's the cause (either by kiam being slow to process, or the api server being slow to serve).
As per logs the client tried 3 times. After looking into SDK documentation it seems that this is client's default. We don't have Prometheus yet, so our visibility into what happens under the K8s' hood is somewhat limited. But if I get more data I'll update you. Meanwhile we will simply adjust clients' timeout settings to remediate the issue Thanks again for your help.
Hello again.
The good news are that we finally have Prometheus set up and running and collecting both system and KIAM metrics. The "bad" news are that we rolled the client timeout change to the most problematic microservices, so issue's frequency decreased dramatically and it became much harder to reproduce. But once it occurs, are there any specific metrics you'd like me to provide you?
And it would be greatly appreciated if you could share your grafana dashboard for KIAM service and/or provide some explanation of the metrics it exposes.
are there any specific metrics you'd like me to provide you?
The API server metrics would be really interesting- in particular to see if times for gets/watches on Pods spike around the same time. We had problems before where our 95th percentile api server responses were all over the place because we hadn't set the requests/limits high enough- they dropped when given enough. It'd be useful to see whether slow deltas being delivered to Kiam are because of something within the apiserver or Kiam.
The Prometheus metrics for Kiam are not great at the moment- they export the same metrics reported with statsd. #35 is a reminder to just replace the StatsD stuff with direct prometheus metrics, and I'd probably instrument Kiam differently to make sure this is an easier thing to debug.
For now, tracking PodCache.processDelta
(I think Prometheus probably rewrites with underscores but you should be able to find it) would be useful- to see whether that flatlines around the same time, i.e. does Kiam have an upper bound thats under the number of deltas it should be processing, how does this vary with the metrics tracking api server response times, number of running pods etc.
Does that help?
Sorry, don't have any grafana dashboards yet (we've historically used Datadog most) but will do soon- I'd love to get a Kiam dashboard defined, I'll try and address that with #35 also.
Hi @pingles We are seeing these issues on our cluster. I will also try to dig up the metrics for kube api . But I am trying to simulate a batch load test. In this load test, we simultaneously schedule about 100+ pods and eventually all the containers exit unable to retrieve credentials and hence failing to connect to S3. Any advice would be greatly appreciated.
To get around errors with the pod cache not being primed i inject the following ENV variables into ANY container thats going to access AWS from my k8s cluster
AWS_METADATA_SERVICE_TIMEOUT=5
AWS_METADATA_SERVICE_NUM_ATTEMPTS=20
https://boto3.readthedocs.io/en/latest/guide/configuration.html
boto3 and aws cli obeys these params and basicly this setting gives the app much more retries to get what it needs during the cache priming.
before I added these env variables, many of my pods would always fail at the first start since many of them start by getting a file or two from a S3 drive.
with this it was smooth sailing and pods comes up on first attempt without errors or pod restarts
I will try and post back, thank you
@roffe that is awesome, it actually worked. At some point I was thinking that the load was too high for KubeAPI to handle. When we were using kube2iam we were getting panics in kube-api. We are running some massive tests and will keep you guys posted if you are interested. Also we would love to understand how to monitor this better
@kumare3 there is a Grafana dashboard for kiam now if you use prometheus metrics & the latest 0.3 beta that will give you visualization of the metrics https://github.com/uswitch/kiam/blob/master/docs/METRICS.md
This issue resembles #46, and might be essentially the same one.
We've switched to
kiam
fromkube2iam
a couple of weeks ago. We're running most of our kubernetes nodes on spot instances, so they are being replaced pretty frequently. When the node is drained before termination, the pods are scheduled on other nodes, and sometimes the application they're running starts throwing the following error:On the agent side I see the following errors:
And those are errors on the server side:
It looks like we have a delay between the time pod is scheduled and the time information about it becomes available to KIAM. If I delay the startup of the app for couple of seconds it fixes the problem. Deleting the problematic pod and letting kubernetes reschedule it does the trick as well.
After looking into your code it seems that increasing the
prefetch-buffer-size
value might help, cause the issue mostly happens when many pods are scheduled at the same time. But maybe I'm missing something.Any advice would be greatly appreciated.
P.S.: We're using
kiam:v2.5
and kubernetesv1.8.6
.