jtblin / kube2iam

kube2iam provides different AWS IAM roles for pods running on Kubernetes
BSD 3-Clause "New" or "Revised" License
1.98k stars 317 forks source link

Prefetch credentials and refresh periodically to better emulate the metadata service. #31

Open marshallbrekka opened 7 years ago

marshallbrekka commented 7 years ago

As it currently stands, credentials are only requested from STS on an as needed basis. Unfortunately this adds significant latency (I've seen > 100ms) compared to the metadata service (< 1ms).

This can cause some issues for clients (like boto) that have short timeouts for accessing the metadata service when going through the credentials provider chains, causing it to think that there aren't any credentials available.

This can be solved with three components

I've submitted a PR #30 that implements this solution, but I'm happy to discuss other solutions, or modifications to the existing PR.

jtblin commented 7 years ago

As it currently stands, credentials are only requested from STS on an as needed basis.

This is not entirely correct, it will hit STS once every 15 minutes as the credentials are cached thereafter. This seemed like an acceptable tradeoff.

This can be solved with three components

  • Fetch credentials as soon as the pod is added
  • A goroutine per pod to refresh the credentials before they expire
  • Stop the goroutine when the pod is deleted

I don't think this solution will scale to thousands of nodes, potentially hundreds of thousands of pods as:

  1. it would be hitting the AWS API limits pretty hard and potentially have knocked down effects on the entire AWS account. This is a show stopper imo.
  2. there could be hundreds of thousands of goroutines running for kube2iam therefore using a lot of performance even if goroutines are cheap.

I would rather increase the caching duration to 60 minutes and/or make it configurable.

marshallbrekka commented 7 years ago

I think there is some misunderstanding about the proposed solution, as well as the issue(s) it was meant to solve, I'll try and explain it better below.

Issues to Resolve

1. Clients fail to get credentials on startup

This was the primary issue, where if a client failed to find credentials from the metadata service within its default timeout (for some this is 1s, and I have seen STS take longer than that on occasion), it would assume that the metadata service is not available, and move on down the credential provider chain.

This can essentially cause a Pod to fail on startup, and never recover, since the clients cache the credential provider decisions.

As it currently stands, credentials are only requested from STS on an as needed basis.

This is not entirely correct, it will hit STS once every 15 minutes as the credentials are cached thereafter. This seemed like an acceptable tradeoff.

In light of the issue described above, I don't think this tradeoff is acceptable.

2. Performance Penalties

This is much less important than the first issue, since it doesn't necessarily cause an outright failure, but for services where latency is important, adding an additional ~100ms on a request is highly undesirable.

Performance Concerns

I don't think this solution will scale to thousands of nodes, potentially hundreds of thousands of pods as:

  1. it would be hitting the AWS API limits pretty hard and potentially have knocked down effects on the entire AWS account. This is a show stopper imo.
  2. there could be hundreds of thousands of goroutines running for kube2iam therefore using a lot of performance even if goroutines are cheap.

I think both of these points are easily addressed by a few ideas.

If you limit the pods that kube2iam is monitoring to the ones that exist on the same worker, then you only need to scale the number of goroutines to the max number of pods that can run on a given worker (this is what the implementation in the PR was doing).

This also means that you should only be making AWS STS API calls 1x per pod per cache duration.

If you wanted to further limit the amount of API calls being made, you could share credentials for all pods on a given worker with the same IAM role, which if your pods were actively using their AWS credentials, would be the exact same as the status quo. However that reduces your ability to trace back API calls to a given pod for audit trail purposes, which I think is undesirable.

Summary

In general I don't think the performance of the implementation should be any worse than the status quo with the worst case scenario of pods consistently using credentials.

Thoughts?

emmanuel commented 7 years ago

We are experiencing timeouts in pods being served by kube2iam. The current blocking approach on first request causes this latency. Since it is triggering timeouts, and our goal for kube2iam is transparency, this makes the current approach unworkable from my perspective. I'm not sure that the proposed solution is ideal, but it seems like a good trade-off to me. Namely, the combination of:

  1. preemptive fetch (kube2iam calls sts:AssumeRole as a pod is placed on a node)
  2. limiting the cache to the pods on the current node (and only those with the kube2iam annotation)
  3. preemptive refresh as late in the expiry window as reasonable (e.g., a few minutes before expiry)

Seems like together that would provide a reasonable balance between observed credential fetch latency, AWS API limits and kube2iam memory/CPU load.

That said, I'm like missing something important ;). What am I missing?

schlomo commented 7 years ago

IMHO kube2iam must guarantee to return valid IAM credentials within 1 second as this is the default of most AWS SDKs. I build something similar for the data center (https://github.com/ImmobilienScout24/afp-alppaca) which was also engineered along these lines.

SharpEdgeMarshall commented 6 years ago

I just finished setting up kube2iam and pods using boto3 restarted 3 times before getting credentials. There is any workaround or news about this issue?

UPDATE: Found a temporary fix for boto, there are 2 env vars AWS_METADATA_SERVICE_TIMEOUT and AWS_METADATA_SERVICE_NUM_ATTEMPTS

gwkunze commented 6 years ago

Ran into this problem as well, honestly I can't imagine anyone NOT running into this issue when using kube2iam for anything serious. Would a PR that implements the suggestions made by @marshallbrekka be acceptable?

jrnt30 commented 6 years ago

Yes, a PR would implementing these features would be welcome. There are a few other nuances with how this is done now with the informer which may change the implementation a bit but something that meets the following:

chrisghill commented 5 years ago

Any update on this? It's been over a year and I'm running into this issue now too. Going to try updating the ENV variables SharpEdgeMarshall mentioned, but would prefer a more permanent fix.

gabrielvcbessa commented 4 years ago

@chrisghill Also facing this issue. Even with the ENV variables setted up is not reliable enough.

chrisghill commented 4 years ago

@gabrielvcbessa what language are you using? I think those ENV variables only work in boto3 (python). The PHP library doesn't support them (at least the version we use doesn't support them).