copumpkin commented 8 years ago

I've often wondered how I might bootstrap an AWS instance to authenticate to vault with no out-of-band manual intervention, by trusting AWS as an identity provider.

The existing app-id auth backend specifies that

An out-of-band process run by security operators map unique user IDs to these app IDs. Example: when an instance is launched, a cloud-init system tells security operators a unique ID for this machine. This process can be scripted, but the key is that it is out-of-band and out of reach of configuration management. (Path: map/user-id/)

which can be painful for fully automated deployments, like in an AWS Autoscaling Group.

I'm wondering if we could use a simple scheme that relies on the oft-overlooked signed identity document provided by an EC2 instance's metadata server. It lives at http://169.254.169.254/latest/dynamic/instance-identity/pkcs7 and is a simple JSON document that includes the instance ID, some pieces of metadata, and a signature by a canonical AWS key (which is unfortunately remarkably hard to find, but can be obtained).

Since the document is signed once and there's no opportunity to inject a nonce into the document being signed, that document would need to be treated as a secret single-use authentication token. But if we do accept that, a brand new EC2 autoscaling instance could bootstrap itself into Vault in the following manner:

On first boot, the EC2 instance generates a standard asymmetric keypair,
The EC2 instance retrieves the pkcs7 signed identity document from the metadata service provided by the hypervisor,
The EC2 instance contacts the Vault server over an encrypted channel, passing it the identity document and the newly generated public key,
The Vault server verifies the AWS identity document signature, records/checks some metadata (e.g., launch time), and otherwise associates the public key with the instance. The Vault server should ensure that no identity document is used more than once.
All future Vault authentication from that instance is performed in standard ways using the generated public key.

Are my goals clear? Does this seem like a sensible way to achieve them?

jefferai commented 8 years ago

I like it. You may want to look at https://github.com/hashicorp/vault/pull/805 for reference as another stab at authentication via AWS (also be sure to look at my first comment for notes about timing w.r.t. developing a backend yourself and the underlying frameworks). One thing I like about this method is that is that as long as the AWS key is known, you don't even need an AWS token in order for a Vault backend to verify validity.

Some thoughts: 1) Ensuring that an identity document is only ever used once could be problematic. If there are timestamps in the metadata, you could set an upper limit on how long it would be valid for authentication, and likewise could then purge old identity documents rather than storage increasing in an unbounded way (which would eventually become problematic for the physical backend).

2) There'd need to be a way -- possibly defined via instance metadata -- to identify what permissions such a machine should get. If the instance metadata come with some kind of group identifier, that could map to e.g. roles in the backend.

copumpkin commented 8 years ago

Thanks for the quick response and the reference to the other scheme! I'll take a closer look.

Ensuring that an identity document is only ever used once could be problematic. If there are timestamps in the metadata, you could set an upper limit on how long it would be valid for authentication, and likewise could then purge old identity documents rather than storage increasing in an unbounded way (which would eventually become problematic for the physical backend).

I'm not sure the "only once" aspect is essential for the security of the scheme, but feels like good practice. Enforcing a time limit since the machine spun up seems like a reasonable approximation to the original goal, and a less stateful one.

There'd need to be a way -- possibly defined via instance metadata -- to identify what permissions such a machine should get. If the instance metadata come with some kind of group identifier, that could map to e.g. roles in the backend.

Good point. The document currently seems to contain information of the following form:

{
  "instanceId" : "i-c495bb93",
  "billingProducts" : [ "bp-xxx" ],
  "accountId" : "xxx",
  "imageId" : "ami-e80xxxx",
  "instanceType" : "c3.xlarge",
  "kernelId" : "aki-825ea7eb",
  "ramdiskId" : null,
  "pendingTime" : "2015-02-24T14:38:43Z",
  "architecture" : "x86_64",
  "region" : "us-east-1",
  "version" : "2010-08-31",
  "availabilityZone" : "us-east-1c",
  "privateIp" : "w.x.y.z",
  "devpayProductCodes" : null
}

(stolen from here)

Which likely wouldn't be much use without being able to cross-reference (via AWS API) against other properties of the instance.

I do have another (hackier) scheme that would tie a brand new instance to its IAM role (likely more interesting for auth purposes) while still retaining the "as long as the AWS key is known, you don't even need an AWS token in order for a Vault backend to verify validity" property you liked. I'd like to run a little further with this one, and then write up the other one if this proves too annoying.

jefferai commented 8 years ago

Don't worry too much about not needing an AWS key. It's a neat property, consuming AWS auth without a key, but practicality trumps academic interest :-)

fieldju commented 8 years ago

@copumpkin what are your thoughts about https://github.com/hashicorp/vault/pull/805 You may also check out https://github.com/hashicorp/vault/issues/406

Some background reading. I believe 805 is based on the following http://ryandlane.com/blog/2015/06/16/custom-service-to-service-authentication-using-iamkms/ Which is written by one of the devs for Confidant

issacg commented 8 years ago

This seems so much simpler than #805, but the downside is that it still doesn't help to match the "user-id" (which seems to be what you're suggesting we use this to verify) with an "app-id", or am I missing something?

Regarding the time limit, AWS support just told me The "pendingTime" date time value in the identity document represents when the instance was launched in a UTC format. so there's a date we can trust (because it's part of the signed document) to limit the timespan to some degree. I feel like that could possibly be coupled with the cubbyhole authentication model to further force a single token for a given instance-id in a single timeframe, too.

fieldju commented 8 years ago

In this model how would a user-id get mapped to a policy?

copumpkin commented 8 years ago

You'd need some sort of separate mapping layer on the Vault side that can take the instance ID and account ID (probably the most useful parts of the document for this purpose) and map them back to permissions that are meaningful to Vault. It's not beautiful, but should only need minimal EC2 read-only access on Vault's side to map the instance ID to e.g., the autoscaling group it came from, tags on the instance, or the instance profile attached to it. Those can then be used to inform the actual authorization decisions in a user-specified manner.

Sorry this is vague :smile:

issacg commented 8 years ago

@copumpkin where is the hard-to-find key located?

copumpkin commented 8 years ago

I had to contact AWS support for it (:open_mouth:) since I couldn't find it anywhere on the public internet. I asked them to update their docs and they said they would do that soon, but also asked me not to distribute it myself. It's a certificate, but it has no chain that's traceable back to a trusted root.

I'll ping them to see if they can publish it on the sooner side.

mostman1043 commented 8 years ago

This is all really great stuff. The one issue is that this method solves for server based assets while not solving for server less (Lambda, as an example). The one thing about #805 is that you could imagine it working for something like Lambda

issacg commented 8 years ago

I've really been mulling the idea of authenticating (specifically through auto-scaling groups) for a while (several months).

I think that at the end of the day, it boils down to what the Hashicorp folks mention in app-id: that there really needs to be an out-of-band process to decide who gets access and who doesn't. While there have been many ideas posted about how vault can do it (this one included), I don't think I'd want Vault to make the decision itself. Certainly, not by hard-coding a single method which needs to make API calls to Amazon or use a hardcoded secret embedded in Vault.

Also, at the end of the day, I want to design my setup to fit my needs as much as possible.

I might be more willing to go that path if/when Vault comes with an interface for external plugins, and the plugins can be managed out-of-band with my vault server.

If it's of any academic interest, I plan on using this document only as a means of verifying that the request is authenticated as from coming from AWS, and then using AWS APIs to query the instance-data from the machine (which contains the chef runlist and environment - I don't use roles). Since that essentially fits the idea of the "userid" and "appid" respectively, based on that, I'll issue a token to Vault. Because I'm free to implement any way I like I can further secure this by checking if the machine is in my AWS account, in a VPC that makes sense, and even if the instance is registered with the auto-scaling group it's supposed to be in. I plan on doing this externally to Vault.

issacg commented 8 years ago

Also, for acedemic interest, I got the following response from Amazon support yesterday:

I have received confirmation that you may share this public key outside your company.

The documentation team has been made aware of this and they will be publishing this information in a future revision of our docs (they did not give an ETA, but it should be added soon).

Please do let me know if I can do anything else to assist.

Best regards,

Michael M.
Amazon Web Services

Based on that, here's the public key needed to make this all work:

copumpkin commented 8 years ago

@issacg great, thanks for releasing that! I wonder why my support person said I couldn't, but yours said you could :confused: maybe they like you better!

Anyway, this seems pretty straightforward to implement the basic idea for now. I don't think this has to be very complicated or need much external help. I also don't see much difference in teaching Vault how to speak to EC2 to query which IAM role a given instance ID is in vs. having it call out to someone else that can do the same thing.

To expand a bit:

The "AWS IAM role/identity document auth backend" allows us to map Vault policies to IAM role ARNs (which would be restricted via PassRole powers inside AWS)
Instance boots up, retrieves the identity document, and sends a registration message to Vault containing the signed identity document
Vault verifies the document with the key above, maps the instance ID to its associated IAM role, and issues the node a token that's attached to the policy associated with that role
The instance now uses that token going forward to speak to Vault, and gets the powers associated with its ARN

issacg commented 8 years ago

My use-case doesn't map the IAM role directly.

There are a lot of folks with a lot of use-cases. Once plugins are opened, it would make sense to add these as plugins, but I still (personally) don't think this should be part of "core" vault, to not force people to set up their AWS setup to need to be a certain way

copumpkin commented 8 years ago

@issacg just so I can better understand, what other mechanism would you want to use to automatically map your ASG nodes to a policy in vault? The only reason I'm going with IAM role is that it allows me to control access to it on the AWS side.

jefferai commented 8 years ago

Just wanted to point people to #948 if they didn't see the reference, as more food for thought.

issacg commented 8 years ago

@copumpkin anything is controllable from the IAM side, based on what permissions you give your users. IAM is no safer than anything else - at the end of the day, any person or machine with the ability to launch an instance that works will be able to set the identifying data - if your instance needs AWS permissions, then you'll allow any operator authorized to launch machines the PassRole permissions.

Anyway, currently I'm looking at instance data which is more flexible (for me) than the role (roles are more shared in my setup)

copumpkin commented 8 years ago

@issacg my only point is that on the AWS side, I can't meaningfully restrict some users from spinning up instances with certain tags or metadata. If someone has RunInstances powers, they can trivially set a VaultRole = "Admin" tag, or put something equivalent to that in user-data.

What's unique (to me at least, in a federated environment where there are multiple IAM users with different levels of power) about the IAM role is that it requires PassRole. So I can give IAM user JoeShmoe the power to PassRole a VaultAdmin role to an instance which gives that instance the power to do fancier things on Vault. IAM user BobLoblaw can also spin up instances but I haven't granted him the power to use VaultAdmin, so BobLoblaw is effectively not able to make machines that have elevated access to Vault.

Does that make sense? I don't think the IAM language in AWS is powerful enough to say "BobLoblaw can only create instances if he applies certain tags to them" (yet?).

P.S: in practice, managed policies might be a better fit for this sort of thing, but also add complexity that a first iteration of the idea wouldn't want.

jefferai commented 8 years ago

BobLoblaw

@copumpkin just won the debate, show's over folks

(edit: no, not seriously)

issacg commented 8 years ago

I understood that, but that only works if VaultAdmin role isn't actually needed by the EC2 instance for anything other than authenticating to vault. Since you can only have a single role on EC2, it creates a bit of a problem if you want to use the Role for something else, for example S3 or CloudWatch. In that case, JoeShmoe needs to PassRole to VaultAdmin anyway...

I suppose you could avoid it by using the Role exclusively for Vault, and then using Vault to get AWS credentials to actually do anything with S3, but it seems a bit wasteful, IMHO

copumpkin commented 8 years ago

@issacg that's what I was saying about managed policies (the limit is now 10 per role, which seems sufficient), but yes, you're right. It's not ideal to use the role directly, which then forces me to go back to needing to query EC2 even in #948. Why can't these things just be simple? :smile:

issacg commented 8 years ago

:)

I had the same thought when I realized I wasn't going to get my user-data signed by AWS, and would need to fetch it by querying EC2 here.

mostman1043 commented 8 years ago

This is narrowing in on the solution we ended up with (which isn't ideal, but works). We knew we wanted to use IAM roles to secure access, so what we built is an out of band token management system that is protected via IAM roles. There are a bunch of ways you could go about doing this, we landed on a combination of using the vault token auth system in combination with IAM protected S3/Dynamo storage with at rest encryption. Think of it as manual #805 :)

Basically, in order to get into the "token" store you'll need access to those OOB resources, which means you will need to have a specific managed policy attached to the Instance/Lambda Function/etc that you are operating.

issacg commented 8 years ago

Not to stir up an anthill here, but I wanted to update that the EC2 identity document keys are now in the official AWS documentation: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-identity-documents.html

jefferai commented 8 years ago

1300 contains the backend that will be going into Vault as the official AWS auth backend. Closing, but I want to be clear that the final design was heavily inspired by all of the discussion around the various possibilities and we are very much appreciative of your efforts!

ewdurbin commented 8 years ago

I just wanted to return to this thread to thank all involved in bringing this into vault. I had been watching patiently for some time and the discussion/design around this feature was both enlightening and came to a very good end. Getting to delete a bunch of custom code and SQS/SNS complexity to obtain the same outcome has really simplified our lives over here :).

vishalnayak commented 8 years ago

@ewdurbin Happy to know that you find it useful!

cedws commented 1 year ago

I've asked AWS's security team if they would consider adding a timestamp or nonce value to the Instance Identity Document so there could be some kind of expiry but they basically told me no. I see that Vault does take some measures to prevent a signed document being maliciously used but to me they don't feel like enough.

hashicorp / vault

AWS identity document auth backend #828

1300 contains the backend that will be going into Vault as the official AWS auth backend. Closing, but I want to be clear that the final design was heavily inspired by all of the discussion around the various possibilities and we are very much appreciative of your efforts!