travis-ci / cyclist

AWS ASG lifecycle thing :tada: :bicyclist:
MIT License
8 stars 2 forks source link

Implement per-instance authz #19

Closed meatballhat closed 8 years ago

meatballhat commented 8 years ago

I'm uncomfortable with long-lived auth tokens that are used by multiple EC2 instances. I would like to implement per-instance authz that goes kinda like:

meatballhat commented 8 years ago

ftr: AFAICT, this may be an OK fit for JWT, but I'm not seeing how JWT+HMAC is necessarily a better fit than throwing auth tokens around that are stored server-side in cyclist, at least with the way cyclist is currently deployed...

igorwwwwwwwwwwwwwwwwwwww commented 8 years ago

Looks sensible. Can you elaborate a little on the threat model and what kind of leakage/escalation you expect?

meatballhat commented 8 years ago

@igorwwwwwwwwwwwwwwwwwwww Sure thing!

Currently, cyclist effectively contains only two levels of privilege: guest and superuser. There are a few routes that are guest-readable and a few routes that require superuser authz, with the latter being mutative. The impact of a malicious request being made to one of the mutative routes is relatively low, given that said routes discard the request body. The worst case scenario I can think of is that an attacker could force the capacity of an autoscaling group to stay at effectively zero, which would be something of a build capacity DoS.

With that in mind, there are superuser tokens being baked into the user-data for every EC2 instance running a worker configured with start and stop hook scripts. Given that we've had incidents in the past where the instance metadata API was readable by untrusted docker containers running on such instances, implementing something like the per-instance auth above would reduce the severity of any future data leaks.

meatballhat commented 8 years ago

More on reducing severity of data leaks: If the above were implemented, then the leak of an instance token would allow an attacker to cause the instance to gracefully terminate "early", which would happen once, after which the leaked token would be unusable.

sarahhodne commented 8 years ago

This looks kinda similar to Vault's response wrapping, so we might be able to borrow some ideas/implementation details from them.

Two thing that comes to mind in relation to that:

Also, this might already be a part of the plan, but I didn't see it mentioned above: The only thing the shared token should be able to do is retrieve the token for a given instance ID?

meatballhat commented 8 years ago

This looks kinda similar to Vault's response wrapping, so we might be able to borrow some ideas/implementation details from them.

Yes!

  • With the suggested steps, it looks like it's theoretically possible for multiple places to get the instance token. If someone has the shared token, and knows the instance ID of an instance that is booting, they could get the instance token. Another option is to only allow retrieval of the token once, but that has to be weighed against the possibility of a network issue occurring and the booting host not getting the token, and then needing to shut down and a new instance to be spun up in its place.

Yes, it is theoretically possible for multiple places to the the instance token. My intent (poorly described in the description) is for the token retrieval route to respond successfully until a separate instance token confirmation request has been received. I'm not sure what you mean about needing to shut down and a new instance spun up in its place.

  • Should the generated instance token expire if it hasn't been fetched in a certain amount of time? I guess that means the instance didn't boot for some reason, so if someone gets the token it won't be the end of the world (since all they can do is shut down the instance, which won't work since it isn't running?), but it might be nice for keeping the datastore clean.

Yes! I would like the instance tokens to have a configurable expiry (~300s?), with every successful request (including heartbeats) extending the TTL.

Also, this might already be a part of the plan, but I didn't see it mentioned above: The only thing the shared token should be able to do is retrieve the token for a given instance ID?

Yes, I edited the description to make that more clear :heart:

sarahhodne commented 8 years ago
  • With the suggested steps, it looks like it's theoretically possible for multiple places to get the instance token. If someone has the shared token, and knows the instance ID of an instance that is booting, they could get the instance token. Another option is to only allow retrieval of the token once, but that has to be weighed against the possibility of a network issue occurring and the booting host not getting the token, and then needing to shut down and a new instance to be spun up in its place.

Yes, it is theoretically possible for multiple places to the the instance token. My intent (poorly described in the description) is for the token retrieval route to respond successfully until a separate instance token confirmation request has been received. I'm not sure what you mean about needing to shut down and a new instance spun up in its place.

What I meant about shutting down and spinning up a new instance is that we would want to balance making the token distribution more secure (by only allowing the token to be fetched once) against the downside of an instance needing to be replaced if a network issue happened and therefore not having a valid token and being unable to get it (since cyclist already thinks it has been received).

My understanding is that all you can do with a single instance token is shut down the instance it's attached to? If so, it's probably not terrible™ if one token leaks, but if someone is able to retrieve many tokens they could potentially DoS the system by shutting down a bunch of instances. As far as I can tell, that would require having the shared token, and guessing instance IDs early on in the booting process, and doing this over time since there's only a short time window per instance.

Using some numbers to estimate how unlikely it is to get an instance ID. Say it takes two minutes from the instance token is generated until the instance has gotten to the state where it gets the token (my understanding is that this happens shortly after boot, and also I think boots are shorter than two minutes, so this should be "rounded up"). Instance IDs are currently 8 hexadecimal characters, which makes roughly 4.3 billion possibilities. Spreading a brute force out against 2 minutes means you need to try about 36 million instance IDs per second to try them all within 2 minutes. Let's say cyclist is fast enough for you to be able to try 10,000 IDs per second. That means you have about a 0.03% chance of getting a valid instance ID in the 2-minute window. Taking into account our average instance lifetime and how often we scale up/down, I seem to get a <1% chance of being able to brute force the instance ID of one instance before it shuts down.

TL;DR We probably don't need to worry about people brute forcing instance IDs to get tokens at our current scale.

  • Should the generated instance token expire if it hasn't been fetched in a certain amount of time? I guess that means the instance didn't boot for some reason, so if someone gets the token it won't be the end of the world (since all they can do is shut down the instance, which won't work since it isn't running?), but it might be nice for keeping the datastore clean.

Yes! I would like the instance tokens to have a configurable expiry (~300s?), with every successful request (including heartbeats) extending the TTL.

I was mostly thinking of removing tokens that haven't even been "fetched", but if we can auto-remove tokens for instances that have stopped responding too that's even better. We'd probably want to limit how long the token is fetchable too, even if it's being used, just in case the "I got the token" message from the instance to cyclist doesn't come through for some reason?

meatballhat commented 8 years ago

@henrikhodne Thank you for such lovely feedback!

I like the idea of a TTL on the instance token fetch window, as this removes the need for a confirmation response. IIUC this would mean a change in the description like:

@@ description @@
  - the first request we expect from the launching instance is auth'd with a long-lived shared token *specifically intended for granting instance tokens*, at which time the instance token is provided
- - the instance uses the instance token to make another request, confirming receipt of the instance token, which then makes the instance token inaccessible through the shared token route
  - the instance is expected to persist the instance token locally and use its presence as an indicator that the shared token steps should be skipped in the future
  - all remaining requests for instance specific routes will only succeed if auth'd with instance token
sarahhodne commented 8 years ago

I like the idea of a TTL on the instance token fetch window, as this removes the need for a confirmation response. IIUC this would mean a change in the description like:

Good point! Didn't think of that. I don't see a downside to removing that step, so long as the token fetch window isn't excessively long. 5 minutes should be fine.