Cronjob (cold start) fails to get role

uswitch / kiam

Integrate AWS IAM with Kubernetes

Apache License 2.0

1.15k stars 238 forks source link

Cronjob (cold start) fails to get role #293

Open jaygorrell opened 5 years ago

jaygorrell commented 5 years ago

Out of 200+ services in my system, the only jobs that fail are the cronjobs that use a role immediately on startup. My suspicion is that kiam doesn't know the cache the role when the job isn't running (these are weekly jobs).

Is there any way to force this, or even set a long timeout on keeping creds fresh?

Joseph-Irving commented 5 years ago

Kiam pre-fetches credentials as pods are scheduled. When the pod starts and requests creds if kiam has got some creds in the cache it will just return them, otherwise it will go and fetch them on request.

For most aws-sdks I believe the timeout for this is around 5 seconds, so this would indicate that either your kiam server cannot respond in time or maybe your app is using a shorter timeout or something.

It would be a good idea to have a look at the logs in your server/agents, can you see the kiam server fetching credentials for your pod? Is it getting timed out? Would also be worth looking at the metrics for your kiam-server, is it cpu starved etc

lomkju commented 5 years ago

We have been facing the same problem. We have around 1.5k+ pods running inside our clusters and new pods take time to get synced with the kiam-server. CronJobs are the most affected as the pods take less than 2 seconds to get scheduled and request creds from the kiam-agent which fails because the kiam-server hasn't discovered the pod. We noticed that kiam-sever takes time to sync the list pod. One workaround is that adding a sleep statement in an init container but this can't be used in production.

Joseph-Irving commented 5 years ago

Have you looked at things like CPU utilisation of your kiam-server? what request/limits are you using? We have around 3k pods in one of our clusters and we don't see this issue. Also worth checking your API server to see if that is struggling with the amount of requests

jaygorrell commented 5 years ago

I think the key point being missed here is the cold-start of cronjobs -- it doesn't seem to have anything to do with load. Since it's a cold-start, the credentials are fetched right as they're needed resulting in errors in the application. If I run a second job immediately after, it works fine.

I was looking for a way to adjust the time kiam keeps credentials fresh after the pod goes away or a way to manually specify roles to keep fresh without them being on a current pod. I'm close to just running an nginx pod with the role I need so it's always fresh.

jaygorrell commented 5 years ago

@Joseph-Irving Do you have any thoughts here? Can I help explain something better?

The gist of this is that things like argo workflows or cronjobs don't have a pod for kiam to be aware of, so there's a race condition when those start up very quickly and try to make AWS calls.

Would it be feasible to add an env var of "extra" roles to maintain in cache regardless of no visible pods using them? That would solve this.

Edit One more point of clarification for some of these is that I'm using argo to parallelize jobs. This means many steps may request creds for the same role at the same time -- this is where kube2iam always struggled. Using kiam, if I run 2 jobs back to back, the first one will populate the cache (and fail) but the second works great because my workflow is able to get the creds from cache.

pingles commented 5 years ago

Kiam will fetch credentials from STS as soon as a Pod goes pending that has a role. Once a pod with that role has credentials fetched it'll be used for other pods also. It's possible that, if you have lots and lots of jobs with lots of different roles starting all at the same time the queue between the pod tracker and credential fetcher is filled, causing things to get buffered.

There's probably two settings you could try adjusting to see if you get the behaviour you want:

https://github.com/uswitch/kiam/blob/master/cmd/kiam/server.go#L49 https://github.com/uswitch/kiam/blob/master/cmd/kiam/server.go#L50

One will have the effect of increasing the number of concurrent STS requests, the other controls how many events to buffer in the channel between the pod watcher and prefetchers. Increasing that will increase memory consumption but may avoid adding back pressure that would result in what you're seeing.

There should also be prometheus metrics that'd help understand whether you're maxing out (kiam_sts_assumerole_current for example).

Hope that helps

pasky commented 5 years ago

We have just 4 cronjobs, scheduled on every minute and taking just a few seconds, and this is a permanent look of these jobs:

morgan-worker-autoscaling-fdc-1570143420-7v97n                0/1     Error                        0          10h
morgan-worker-autoscaling-fdc-1570143420-mkjhn                0/1     Error                        0          10h
morgan-worker-autoscaling-fdc-1570143420-mzqh4                0/1     Error                        0          10h
morgan-worker-autoscaling-fdc-1570143420-w6jjd                0/1     Error                        0          10h
morgan-worker-autoscaling-fdc-1570143420-wm6lj                0/1     Error                        0          10h
morgan-worker-autoscaling-fdc-1570143420-wmkn9                0/1     Error                        0          10h
morgan-worker-autoscaling-fdc-1570179720-ldtf4                0/1     Completed                    0          3m14s
morgan-worker-autoscaling-fdc-1570179780-44nh5                0/1     Completed                    0          119s
morgan-worker-autoscaling-fdc-1570179780-qlfpf                0/1     Error                        0          2m17s
morgan-worker-autoscaling-fdc-1570179780-zm42h                0/1     Error                        0          2m9s
morgan-worker-autoscaling-fdc-1570179840-9lqb8                0/1     Completed                    0          62s
morgan-worker-autoscaling-fdc-1570179840-lkp2m                0/1     Error                        0          70s
morgan-worker-autoscaling-fdc-1570179900-bnpr6                0/1     Error                        0          12s
morgan-worker-autoscaling-fdc-1570179900-n5hzw                1/1     Running                      0          4s

(Maybe having minute-wise cronjobs isn't that great a design - we are looking at changing that -, but the trouble is that we also get kiam errors any time we restart long-running pods - routinely they take 5 to 10 attempts to start before they finally get proper credentials and start running.)

Maybe we have some misconfiguration but if so, it's been really hard to debug.

QuantumBJump commented 4 years ago

We're also running into what I think is this same problem - occasionally when a weekly cronjob we run starts, it will fail to get credentials, but on a second run everything will work fine. It just happened to us today, and looking through the agent logs I see this from when the job ran (I've collapsed down duplicate log lines for readability).

{"level":"warning","msg":"error finding role for pod: rpc error: code = Unknown desc = pod not found","pod.ip":"10.2.88.4","time":"2020-07-24T10:05:55Z"} (x 12)
{"addr":"10.2.88.4:49752","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unknown desc = pod not found","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2020-07-24T10:05:56Z"}
{"addr":"10.2.88.4:49752","duration":1000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2020-07-24T10:05:56Z"}
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unknown desc = pod not found","pod.ip":"10.2.88.4","time":"2020-07-24T10:08:10Z"} (x 9)

I've just added a sleep to the beginning of the job to see if this works around the problem, although ideally we wouldn't have to do this. Any input on if this is the same problem we're encountering/any better ways to work around it until it's fixed?

gladiatr72 commented 3 years ago

(Also using EKS here)

The easiest way I've found to mitigate this is to add a NOP init-container. I reuse the image:tag but amend it with

  template:
    spec:
      initContainers:
         name: contemplation
         command: ["/bin/sh", "-c", "echo"]
      containers:

Beats modifying the container image, anyway :)

I was originally thinking I would need to add a short sleep(1) but so far the additional latency (wrt STS creds) seems to be ample.

What clued me in: [...]pod.status.ip= pod.status.phase=Pending resource.version=3674841 in kiam-server logs.

your in-other-kubernetes-envs-would-be-etcd stream is AWS Kinesis. After last week's us-e-1 outage, EKS is obviously one of the many services using Kinesis for message dissemination. Like most non-dedicated cloud services, Kinesis offers a (reasonably) solid service profile but is notably pokey when compared to dedicated etcd stores when playing in Kubeland. Since kiam (and similar operators) are entirely reliant on this event stream, it is not all that surprising that such errors occur.

EKS uses the aws-vpc CNI as well, so there's most likely some pod <-> CNI <-> Cloud conversations that have to occur before a VPC IP is returned for assignment to the pod. (Yet more messages having to pass between and through yet more AWS services)

Anyway, you get the idea--these services are chugging away at their best.. eventual-consistenciness(?) with the usual Big Cloud guarantee of there being no guarantee except that putting ThingX in ThingZ will result in a ThingN being emitted somewhere... probably.

I do not doubt that the IP address is actually assigned before the pod cons running. I just don't think a controller, such as kiam-server, has more than an average chance on receiving the necessary data from the event bus for being able to respond to fast-bootstrap/run-and-done jobs without a little help.