uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Unable to assign role due to "pod not found" error #47

Closed ghost closed 5 years ago

ghost commented 6 years ago

This issue resembles #46, and might be essentially the same one.

We've switched to kiam from kube2iam a couple of weeks ago. We're running most of our kubernetes nodes on spot instances, so they are being replaced pretty frequently. When the node is drained before termination, the pods are scheduled on other nodes, and sometimes the application they're running starts throwing the following error:

unable to sign request without credentials set

On the agent side I see the following errors:

{
    "addr": "100.96.143.201:37862",
    "level": "error",
    "method": "GET",
    "msg": "error processing request: rpc error: code = Unknown desc = pod not found",
    "path": "/latest/meta-data/iam/security-credentials/",
    "status": 500,
    "time": "2018-03-15T18:59:02Z"
}

And those are errors on the server side:

{
    "level": "error",
    "msg": "error finding pod: pod not found",
    "pod.ip": "100.96.143.201",
    "time": "2018-03-15T18:59:01Z"
}

It looks like we have a delay between the time pod is scheduled and the time information about it becomes available to KIAM. If I delay the startup of the app for couple of seconds it fixes the problem. Deleting the problematic pod and letting kubernetes reschedule it does the trick as well.

After looking into your code it seems that increasing the prefetch-buffer-size value might help, cause the issue mostly happens when many pods are scheduled at the same time. But maybe I'm missing something.

Any advice would be greatly appreciated.

P.S.: We're using kiam:v2.5 and kubernetes v1.8.6.

pingles commented 6 years ago

Thanks for raising this.

Yep- most likely due to the delay between the pod being picked up by the watcher. You should have seen some more retry messages from the server/agent as it attempted to keep checking for the pod? These should keep happening until the client disconnects.

The prefetch buffer controls the number of new pods that, having been started, will be passed to the prefetcher to request credentials- the error you've reported would be before the client requests credentials so I don't think it would help.

How many pods are being scheduled at the same time? The other thing you could do is turn on debug logging (just as you schedule everything- there'll be a lot of noise), you should see a load of messages with processing delta and the pod details- if you get the same error could you check to see whether it's just a timing issue with receiving the pod data?

Could you also give me any details about the way you've got kiam deployed (ie. how many server processes you run, are they also on nodes deployed as spot instances etc.)

ghost commented 6 years ago

Thank you.

Yeah, you're right, the server tries multiple times to find the pod, until eventually the error is propagated to the agent. So I have multiple server-side errors followed by a single agent-side error. After digging a little bit more in our Kibana, I found that the problematic pod actually got its credentials assigned by kiam:

{
    "credentials.access.key": "***",
    "credentials.expiration": "2018-03-15T19:06:48Z",
    "credentials.role": "***",
    "generation.metadata": 0,
    "level": "info",
    "msg": "fetched credentials",
    "pod.iam.role": "***",
    "pod.name": "***",
    "pod.namespace": "default",
    "pod.status.ip": "",
    "pod.status.phase": "Pending",
    "resource.version": "13076366",
    "time": "2018-03-15T18:58:48Z"
}

But it had no IP address yet so it was not written to the cache.

Each node runs ~50 pods, so once K8s starts draining it, these 50 pods have to be rescheduled elsewhere. The issue happens on our production cluster, so I have to act with extra care, but I'll try to change the log level to debug and see if I can get any insights.

We have 3 K8s master nodes that are ondemand instances and are running the kiam-server process which uses the default --fetchers value(8). The K8s nodes are a mix between ondemands and spots. The agent is running on all nodes including the masters(we have daemonsets that require kiam functionality).

pingles commented 6 years ago

Cool... so that "fetched credentials" message means that it was able to retrieve the credentials from AWS (most likely caused by the prefetcher given the pod is pending).

Were you able to add the debug logging on? I'd like to see all entries with the message "processing delta" for the pods that have problems. Every time a pod changes state it'd be tracked with that- if we don't see any messages for the pod(s) in question then something's being dropped for some reason. If we do get log messages then it's potentially a timing issue (i.e. the pod state updates are processes too slow relative to how quickly pods start and request credentials).

It's good to know that the server processes are running on the On Demand instances- when you next see the problem could you check the server processes aren't being restarted/killed and that they have all have a nice long uptime please.

ghost commented 6 years ago

I plan to enable debug logging this Sunday and then try to reproduce the issue. Regarding server pods' restarts - they've never restarted since we started using kiam. The agent pods have restarted/crashed multiple times, although I'm not sure these crashes correlate with the errors. But I'll doublecheck.

pingles commented 6 years ago

Ok cool- well that sounds most likely that there's a race between the pod starting (and requesting credentials) and how quickly pod data arrives from the watcher.

Please let me know what you find and post back here- really keen to improve this.

ghost commented 6 years ago

Just to let you know that I've set kiam-server's log level to debug and terminated multiple K8s nodes. But the issue didn't occur. I even tried to terminate multiple nodes at once, but this didn't help either. It seems to be related either to overall cluster load(traffic we get on Sundays is usually low) or to kiam-server pods' uptime. I'll try to reproduce this again once we reach our peak traffic and will update with the findings.

pingles commented 6 years ago

Brilliant, thank you.

On Sun, 25 Mar 2018 at 14:43, Alex G. notifications@github.com wrote:

Just to let you know that I've set kiam-server's log level to debug and terminated multiple K8s nodes. But the issue didn't occur. I even tried to terminate multiple nodes at once, but this didn't help either. It seems to be related either to overall cluster load(traffic we get on Sundays is usually low) or to kiam-server pods' uptime. I'll try to reproduce this again once we reach our peak traffic and will update with the findings.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/uswitch/kiam/issues/47#issuecomment-375971559, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEflb-993vzhinrNA8mGLo6b_WTqG7ks5th58DgaJpZM4Ss6FN .

ghost commented 6 years ago

So here we go(this particular pod was not scheduled due to K8s node drain, but probably due to horizontal pod autoscaler kicking in to scale the deployment):

[
    {
        "cache.delta.type": "Added",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "processing delta",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510380",
        "time": "2018-03-25T15:38:09Z"
    },
    {
        "credentials.access.key": "ASI***QKQ",
        "credentials.expiration": "2018-03-25T15:48:52Z",
        "credentials.role": "***",
        "generation.metadata": 0,
        "level": "info",
        "msg": "fetched credentials",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510380",
        "time": "2018-03-25T15:38:09Z"
    },
    {
        "cache.delta.type": "Added",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "announced pod",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510380",
        "time": "2018-03-25T15:38:09Z"
    },
    {
        "credentials.access.key": "ASI***QKQ",
        "credentials.expiration": "2018-03-25T15:48:52Z",
        "credentials.role": "***",
        "generation.metadata": 0,
        "level": "info",
        "msg": "fetched credentials",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510389",
        "time": "2018-03-25T15:38:10Z"
    },
    {
        "cache.delta.type": "Updated",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "announced pod",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510389",
        "time": "2018-03-25T15:38:10Z"
    },
    {
        "cache.delta.type": "Updated",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "processing delta",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510389",
        "time": "2018-03-25T15:38:10Z"
    },
    {
        "credentials.access.key": "ASI***QKQ",
        "credentials.expiration": "2018-03-25T15:48:52Z",
        "credentials.role": "***",
        "generation.metadata": 0,
        "level": "info",
        "msg": "fetched credentials",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510393",
        "time": "2018-03-25T15:38:10Z"
    },
    {
        "cache.delta.type": "Updated",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "announced pod",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510393",
        "time": "2018-03-25T15:38:10Z"
    },
    {
        "cache.delta.type": "Updated",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "processing delta",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "",
        "pod.status.phase": "Pending",
        "resource.version": "17510393",
        "time": "2018-03-25T15:38:10Z"
    },
    {
        "level": "error",
        "msg": "error finding pod: pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "error",
        "msg": "error finding pod: pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:21Z"
    },
    {
        "credentials.access.key": "ASI***QKQ",
        "credentials.expiration": "2018-03-25T15:48:52Z",
        "credentials.role": "***",
        "generation.metadata": 0,
        "level": "info",
        "msg": "fetched credentials",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "100.115.1.182",
        "pod.status.phase": "Running",
        "resource.version": "17510549",
        "time": "2018-03-25T15:38:24Z"
    },
    {
        "cache.delta.type": "Updated",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "announced pod",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "100.115.1.182",
        "pod.status.phase": "Running",
        "resource.version": "17510549",
        "time": "2018-03-25T15:38:24Z"
    },
    {
        "cache.delta.type": "Updated",
        "cache.object": "pod",
        "generation.metadata": 0,
        "level": "debug",
        "msg": "processing delta",
        "pod.iam.role": "***",
        "pod.name": "***-worker-556fcbd4b6-44kss",
        "pod.namespace": "default",
        "pod.status.ip": "100.115.1.182",
        "pod.status.phase": "Running",
        "resource.version": "17510549",
        "time": "2018-03-25T15:38:24Z"
    }
]
pingles commented 6 years ago

Interesting- so there's a handful of messages at the end where the phase goes from Pending to Running which are recorded after the "error finding pod: pod not found" errors. The sever returned two errors to the agent (the last at 2018-03-25T15:38:21Z) and the server was informed of the pod's IP at 2018-03-25T15:38:24Z.

It's interesting that there's only two errors- it suggests that the agent only attempted to find the pod twice- between 2018-03-25T15:38:18Z and 2018-03-25T15:38:21Z which is a pretty short retry period (I think the Java SDK has a 5 second period read timeout period).

I don't suppose you have the agent logs around the same time- I'd like to know when the client first requests credentials from the agent (causing it to fetch from the server) and when it errors, I'm expecting its only twice but it'd be useful to confirm.

Is there anything in particular you could share about this client process please- which language/platform/AWS SDK does it use?

I'll also try and do some more digging/tests and see if there's anything else that can be done to speed things up.

Thanks for all your help collecting the data!

ghost commented 6 years ago

Thanks a lot for you time and effort. So first of all, just to correct you, and as I've already mentioned, I omitted most of the Pod not found error messages, leaving only the first and the last ones just to give you a sense of timeline. There were ~30 such errors in total. To your question, this app is running on ruby:2.2.7-slim docker image and has aws-sdk v2.10.32 installed. The app has initialized Kinesis client on startup and then started publishing events to Kinesis stream, which resulted in errors. Here're the agent logs:

[
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:18Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "addr": "100.115.1.182:60826",
        "level": "error",
        "method": "GET",
        "msg": "error processing request: rpc error: code = Unknown desc = pod not found",
        "path": "/latest/meta-data/iam/security-credentials/",
        "status": 500,
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T16:38:19Z"
    },
    {
        "addr": "100.115.1.182:60826",
        "headers": {
            "Content-Type": [
                "text/plain; charset=utf-8"
            ],
            "X-Content-Type-Options": [
                "nosniff"
            ]
        },
        "level": "info",
        "method": "GET",
        "msg": "processed request",
        "path": "/latest/meta-data/iam/security-credentials/",
        "status": 500,
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:19Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "addr": "100.115.1.182:32818",
        "level": "error",
        "method": "GET",
        "msg": "error processing request: rpc error: code = Unknown desc = pod not found",
        "path": "/latest/meta-data/iam/security-credentials/",
        "status": 500,
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "addr": "100.115.1.182:32818",
        "headers": {
            "Content-Type": [
                "text/plain; charset=utf-8"
            ],
            "X-Content-Type-Options": [
                "nosniff"
            ]
        },
        "level": "info",
        "method": "GET",
        "msg": "processed request",
        "path": "/latest/meta-data/iam/security-credentials/",
        "status": 500,
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:20Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:21Z"
    },
    {
        "level": "warning",
        "msg": "error finding role for pod: rpc error: code = Unknown desc = pod not found",
        "pod.ip": "100.115.1.182",
        "time": "2018-03-25T15:38:21Z"
    },
    {
        "addr": "100.115.1.182:32998",
        "level": "error",
        "method": "GET",
        "msg": "error processing request: rpc error: code = Unknown desc = pod not found",
        "path": "/latest/meta-data/iam/security-credentials/",
        "status": 500,
        "time": "2018-03-25T15:38:21Z"
    },
    {
        "addr": "100.115.1.182:32998",
        "headers": {
            "Content-Type": [
                "text/plain; charset=utf-8"
            ],
            "X-Content-Type-Options": [
                "nosniff"
            ]
        },
        "level": "info",
        "method": "GET",
        "msg": "processed request",
        "path": "/latest/meta-data/iam/security-credentials/",
        "status": 500,
        "time": "2018-03-25T15:38:21Z"
    }
]

And just in case, here's the daemon.log from K8s node itself:

Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.036847    2582 kubelet.go:1837] SyncLoop (ADD, "api"): "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.107776    2582 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-5fw7n" (UniqueName: "kubernetes.io/secret/85738579-3042-11e8-bace-0a2c434faade-default-token-5fw7n") pod "***-worker-556fcbd4b6-44kss" (UID: "85738579-3042-11e8-bace-0a2c434faade")
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.208308    2582 reconciler.go:257] operationExecutor.MountVolume started for volume "default-token-5fw7n" (UniqueName: "kubernetes.io/secret/85738579-3042-11e8-bace-0a2c434faade-default-token-5fw7n") pod "***-worker-556fcbd4b6-44kss" (UID: "85738579-3042-11e8-bace-0a2c434faade")
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.218310    2582 operation_generator.go:484] MountVolume.SetUp succeeded for volume "default-token-5fw7n" (UniqueName: "kubernetes.io/secret/85738579-3042-11e8-bace-0a2c434faade-default-token-5fw7n") pod "***-worker-556fcbd4b6-44kss" (UID: "85738579-3042-11e8-bace-0a2c434faade")
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.345331    2582 kuberuntime_manager.go:374] No sandbox for pod "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)" can be found. Need to start a new one
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.664 [INFO][18673] calico.go 87: Extracted identifiers ContainerID="ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657" Node="kube-v186-nodes-default-04c1" Orchestrator="k8s" Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: { Labels []struct { Key string "json:\"key\""; Value string "json:\"value\"" } "json:\"labels,omitempty\"" }{Labels:[]struct { Key string "json:\"key\""; Value string "json:\"value\"" }(nil)}}}}, EtcdScheme:"", EtcdKeyFile:"", EtcdCertFile:"", EtcdCaCertFile:""} Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.664 [INFO][18673] utils.go 250: Configured environment: [LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root LOGNAME=root USER=root SHELL=/bin/sh DAEMON_ARGS=--allow-privileged=true --cgroup-root=/ --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-custom-metrics=true --enable-debugging-handlers=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% --feature-gates=ExperimentalCriticalPodAnnotation=true --hostname-override=ip-172-16-119-196.fiverr.internal --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-labels=kops.k8s.io/instancegroup=nodes,kubernetes.io/role=node,node-role.kubernetes.io/node= --non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --register-schedulable=true --require-kubeconfig=true --v=2 --cloud-config=/etc/kubernetes/cloud.config --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/ CNI_COMMAND=ADD CNI_CONTAINERID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_NETNS=/proc/18588/ns/net CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=***-worker-556fcbd4b6-44kss;K8S_POD_INFRA_CONTAINER_ID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_IFNAME=eth0 CNI_PATH=/opt/calico/bin:/opt/cni/bin/ ETCD_ENDPOINTS=http://etcd-c.internal.kube-v186.fiverr.internal:4001,http://etcd-d.internal.kube-v186.fiverr.internal:4001,http://etcd-e.internal.kube-v186.fiverr.internal:4001 KUBECONFIG=/etc/cni/net.d/calico-kubeconfig K8S_API_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJjYWxpY28tdG9rZW4ta21sejIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIj
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.674 [INFO][18673] k8s.go 67: Extracted identifiers for CmdAddK8s Node="kube-v186-nodes-default-04c1" Orchestrator="k8s" Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: I0325 15:38:10.710480    2582 kubelet.go:1871] SyncLoop (PLEG): "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)", event: &pleg.PodLifecycleEvent{ID:"85738579-3042-11e8-bace-0a2c434faade", Type:"ContainerStarted", Data:"ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657"}
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.802 [INFO][18748] utils.go 250: Configured environment: [LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root LOGNAME=root USER=root SHELL=/bin/sh DAEMON_ARGS=--allow-privileged=true --cgroup-root=/ --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-custom-metrics=true --enable-debugging-handlers=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% --feature-gates=ExperimentalCriticalPodAnnotation=true --hostname-override=ip-172-16-119-196.fiverr.internal --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-labels=kops.k8s.io/instancegroup=nodes,kubernetes.io/role=node,node-role.kubernetes.io/node= --non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --register-schedulable=true --require-kubeconfig=true --v=2 --cloud-config=/etc/kubernetes/cloud.config --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/ CNI_COMMAND=ADD CNI_CONTAINERID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_NETNS=/proc/18588/ns/net CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=***-worker-556fcbd4b6-44kss;K8S_POD_INFRA_CONTAINER_ID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657 CNI_IFNAME=eth0 CNI_PATH=/opt/calico/bin:/opt/cni/bin/ ETCD_ENDPOINTS=http://etcd-c.internal.kube-v186.fiverr.internal:4001,http://etcd-d.internal.kube-v186.fiverr.internal:4001,http://etcd-e.internal.kube-v186.fiverr.internal:4001 KUBECONFIG=/etc/cni/net.d/calico-kubeconfig K8S_API_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJjYWxpY28tdG9rZW4ta21sejIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIj
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.803 [INFO][18748] calico-ipam.go 177: Auto assigning IP Workload="default.***-worker-556fcbd4b6-44kss" assignArgs=client.AutoAssignArgs{Num4:1, Num6:0, HandleID:(*string)(0xc4203139f0), Attrs:map[string]string(nil), Hostname:"kube-v186-nodes-default-04c1", IPv4Pools:[]net.IPNet{}, IPv6Pools:[]net.IPNet{}}
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.814 [INFO][18748] ipam.go 816: Creating new handle: default.***-worker-556fcbd4b6-44kss
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.823 [INFO][18748] calico-ipam.go 205: IPAM Result Workload="default.***-worker-556fcbd4b6-44kss" result.IPs=[]*current.IPConfig{(*current.IPConfig)(0xc4200ecba0)}
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.825 [INFO][18673] k8s.go 264: Populated endpoint Workload="default.***-worker-556fcbd4b6-44kss" endpoint=WorkloadEndpoint(Node=kube-v186-nodes-default-04c1, Orchestrator=k8s, Workload=default.***-worker-556fcbd4b6-44kss, Name=eth0)
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.862 [INFO][18673] k8s.go 288: Added Mac, interface name, and active container ID to endpoint Workload="default.***-worker-556fcbd4b6-44kss" endpoint=WorkloadEndpoint(Node=kube-v186-nodes-default-04c1, Orchestrator=k8s, Workload=default.***-worker-556fcbd4b6-44kss, Name=eth0, ActiveInstanceID=ef6125e79469671926340aca08dbd640cbe72bd93531654573f2114d715fb657)
Mar 25 15:38:10 ip-172-16-119-196 kubelet[2582]: 2018-03-25 15:38:10.866 [INFO][18673] k8s.go 296: Wrote updated endpoint to datastore Workload="default.***-worker-556fcbd4b6-44kss"
Mar 25 15:38:14 ip-172-16-119-196 kubelet[2582]: I0325 15:38:14.813701    2582 kubelet.go:1871] SyncLoop (PLEG): "***-worker-556fcbd4b6-44kss_default(85738579-3042-11e8-bace-0a2c434faade)", event: &pleg.PodLifecycleEvent{ID:"85738579-3042-11e8-bace-0a2c434faade", Type:"ContainerStarted", Data:"af7d7945a1d1420a6cbf33edf9ac036a384f37155f3c934d3ee6a5a01312175d"}
pingles commented 6 years ago

Thanks- sorry I misunderstood about your first/last messages before. The warnings are all retries within the agent, the "processed request" messages would suggest that the client is retrying twice?

I can't see anything that immediately jumps out as strange- the time that the agent retries seems shorter than it should be (it's configured to retry for up to 5 seconds) and there's no messages about a cancelled context, maybe it's just the clocks in the log messages being a little out.

Do you have any prometheus metrics around the api server response times- be good to see whether there's any variability there (slow responses etc.) to see whether something is limiting how quickly the watcher can forward data to the server's pod cache. We had quite high 95th percentiles from the apiservers at one point because we'd forgotten to set requests and limits- it was just picking up our cluster defaults- increasing those improved performance quite significantly, both making it faster and more consistent.

pingles commented 6 years ago

I'll make a note to add some tests and benchmarks around the code that processes watch deltas to make sure there isn't anything that blocks during processing deltas- I don't think there is but it'd be better to prove it.

I'll keep thinking about other things that could be causing slow updates- it definitely feels like that's the cause (either by kiam being slow to process, or the api server being slow to serve).

ghost commented 6 years ago

As per logs the client tried 3 times. After looking into SDK documentation it seems that this is client's default. We don't have Prometheus yet, so our visibility into what happens under the K8s' hood is somewhat limited. But if I get more data I'll update you. Meanwhile we will simply adjust clients' timeout settings to remediate the issue Thanks again for your help.

ghost commented 6 years ago

Hello again.

The good news are that we finally have Prometheus set up and running and collecting both system and KIAM metrics. The "bad" news are that we rolled the client timeout change to the most problematic microservices, so issue's frequency decreased dramatically and it became much harder to reproduce. But once it occurs, are there any specific metrics you'd like me to provide you?

And it would be greatly appreciated if you could share your grafana dashboard for KIAM service and/or provide some explanation of the metrics it exposes.

pingles commented 6 years ago

are there any specific metrics you'd like me to provide you?

The API server metrics would be really interesting- in particular to see if times for gets/watches on Pods spike around the same time. We had problems before where our 95th percentile api server responses were all over the place because we hadn't set the requests/limits high enough- they dropped when given enough. It'd be useful to see whether slow deltas being delivered to Kiam are because of something within the apiserver or Kiam.

The Prometheus metrics for Kiam are not great at the moment- they export the same metrics reported with statsd. #35 is a reminder to just replace the StatsD stuff with direct prometheus metrics, and I'd probably instrument Kiam differently to make sure this is an easier thing to debug.

For now, tracking PodCache.processDelta (I think Prometheus probably rewrites with underscores but you should be able to find it) would be useful- to see whether that flatlines around the same time, i.e. does Kiam have an upper bound thats under the number of deltas it should be processing, how does this vary with the metrics tracking api server response times, number of running pods etc.

Does that help?

Sorry, don't have any grafana dashboards yet (we've historically used Datadog most) but will do soon- I'd love to get a Kiam dashboard defined, I'll try and address that with #35 also.

kumare3 commented 6 years ago

Hi @pingles We are seeing these issues on our cluster. I will also try to dig up the metrics for kube api . But I am trying to simulate a batch load test. In this load test, we simultaneously schedule about 100+ pods and eventually all the containers exit unable to retrieve credentials and hence failing to connect to S3. Any advice would be greatly appreciated.

roffe commented 6 years ago

To get around errors with the pod cache not being primed i inject the following ENV variables into ANY container thats going to access AWS from my k8s cluster

AWS_METADATA_SERVICE_TIMEOUT=5
AWS_METADATA_SERVICE_NUM_ATTEMPTS=20

https://boto3.readthedocs.io/en/latest/guide/configuration.html

boto3 and aws cli obeys these params and basicly this setting gives the app much more retries to get what it needs during the cache priming.

before I added these env variables, many of my pods would always fail at the first start since many of them start by getting a file or two from a S3 drive.

with this it was smooth sailing and pods comes up on first attempt without errors or pod restarts

kumare3 commented 6 years ago

I will try and post back, thank you

kumare3 commented 6 years ago

@roffe that is awesome, it actually worked. At some point I was thinking that the load was too high for KubeAPI to handle. When we were using kube2iam we were getting panics in kube-api. We are running some massive tests and will keep you guys posted if you are interested. Also we would love to understand how to monitor this better

roffe commented 6 years ago

@kumare3 there is a Grafana dashboard for kiam now if you use prometheus metrics & the latest 0.3 beta that will give you visualization of the metrics https://github.com/uswitch/kiam/blob/master/docs/METRICS.md