Add a gevent pool for refreshing STS assumed credentials

lyft / metadataproxy

A proxy for AWS's metadata service that gives out scoped IAM credentials from STS

Other

456 stars 69 forks source link

Add a gevent pool for refreshing STS assumed credentials #1

Closed ryan-lane closed 3 years ago

ryan-lane commented 8 years ago

The metadata proxy can know when IAM credentials are about to expire. We should add a gevent pool that runs occasionally, checks to see if any credentials need to be renewed, and renew them before they expire. The goal is to remove the STS assume from the critical path of the application, as the STS assume can be a bit slow.

schlomo commented 8 years ago

This is a very nice project!

When creating our Amazon Federation Proxy, which provides IAM credentials to all servers in an on-premise data center (and people too), we noticed that most AWS SDKs assume that they get IAM credentials from the EC2 metadata service within 1 second.

To comply with this requirement we created afp-alppaca as a sidecar service which is very similar to your metadataproxy. However, it implements pre-fetching and cacheing of the IAM credentials to

guarantee a valid credential response within 1 second
allow the backend server afp-core to have a downtime. If the downtime is less than ~30 minutes then nobody will be affected by that.

Maybe you can copy some ideas or code from there to solve this issue.

ryan-lane commented 8 years ago

The code does currently cache credentials, so once a role is fetched via STS it'll return well within 1s. This issue is describing what you describe otherwise. When credentials are about to expire, the proxy itself should renew so that containers don't need to refetch.

prefetching is... difficult, because you don't know what roles you'll need to fetch. If you know which roles are going to be ahead of time it's definitely possible as an end-user to prewarm the cache by running a few containers that just curl the IAM endpoints before starting any other containers.

schlomo commented 8 years ago

The 1-second limit will hit you exactly on the first request. That is also the request by which the SDK decides if to use EC2 metadata service credentials or to try other sources of credentials. I am not sure the SDKs would make that check and decision more than once if the first attempt failed.

IMHO one can assume that the target-role does not change at run-time. In our use-case the target role depends on the IP of a server (similar to your Docker IP lookup). In your case the IAM_ROLE environment variable should stay the same while a Docker container runs.

I guess you could iterate over all the Docker containers and fetch the IAM credentials for them even before a container asks for credentials. If it asks for them you can reply from the cache. That approach would also allow you to skip the IP lookup on access as you could rely on the cached data from when you iterated over all containers.

willglynn commented 8 years ago

swipely/iam-docker follows the Docker event stream to observe container creation so it can fetch credentials before the first credentials request arrives; see docker/event_handler.go. The credential store additionally retains credentials and refreshes prior to expiration.

ryan-lane commented 8 years ago

Ah. Interesting. Nice. I'll have to apply that.

brandond commented 6 years ago

I've been running into this issue lately - CI jobs failing about 20% of the time due to the initial credential request taking longer than 1 second. I can add an initial request to prewarm the cache, but it'd be nice if metadataproxy would handle that on its own.

dschaller commented 3 years ago

Thank you for you contribution to this repository.

Closing this contribution as this repository is being archived.