Better documentation on configuring TLS

blalor commented 8 years ago

I've really gone down the rabbit hole trying to get TLS configured with Vault. Given the importance of end-to-end encryption for the Vault service, I think the lack of documentation is a huge impediment.

So far I've determined the following client → server paths for talking to Vault, and the addresses used to make the connection, given a HA setup with load balancer:

end user → load balancer: vault.$CORP_TLD
standby vault instance → active vault instance: active instance's advertise_addr (local IP if not provided) for request forwarding
consul → vault instance: localhost or instance's advertise_addr
application → vault instance: vault.service.consul using Consul service discovery

A CA will not grant a certificate that contains a non-public hostname, so it is impossible to buy a cert from a company like Trustwave that contains all of the above hostnames. My current plan -- which includes allowing public access to my Vault instance -- is to use an ELB in front of the Vault instances and have the ELB terminate SSL (which is somewhat insecure as the ELB then has access to the decrypted data). The Vault instances will then use certs from an internal CA that will have CN and SAN entries for localhost, vault.service.consul, and whatever pattern I decide on for the advertise_addr.

It would be good to acknowledge the complexities of this, including guidance on how to create the certificates to address all the communication paths.

jefferai commented 8 years ago

Hi @blalor,

Given that you are allowing public access to your Vault instance(s), why not simply have your internal clients use the public address as well? Then you can use commercial CA certificates as you desire.

In addition, if you set it up that way, you don't need to use a load balancer (which as you've said you don't want to do).

Also, there are load balancers that do not need to decrypt the traffic going through them. HAproxy, for instance, can perform purely TCP-based proxying while still using SNI for routing.

blalor commented 8 years ago

Using the public address doesn't solve the problem of request forwarding from a standby Vault instance to the active instance. There's also the issue of health checks, which need to target the local instance.

ELBs support TCP pass-through mode, too.

My partial solution right now is to use a certificate with SAN entries for the public FQDN and wildcard entries for an internal -- but globally unique -- domain. If we don't use vault.service.consul that should be ok.

jefferai commented 8 years ago

Using the public address doesn't solve the problem of request forwarding from a standby Vault instance to the active instance.

You can specify what address is given to the client for forwarding. So you can use the public addresses there, as well.

There's also the issue of health checks, which need to target the local instance.

These can also use public addresses.

blalor commented 8 years ago

My point here was that there are some finer points to getting TLS configured properly, and there's no guidance given in the documentation. Pointing out the importance of setting advertise_addr correctly -- including scheme -- would be a huge step forward.

dansteen commented 8 years ago

I agree with this. It's not that there aren't solutions, it's just that it's not obvious what you are going to have to do going in. In our case, we are not trying to provide public access, but it's still difficult since each box has it's own name, and the ssl certificate needs to match. We could use just the IP addresses, but then we need to add IP SANs into the certificate, and not all registrars will allow that ( I actually have not found one that will yet - though I'm sure there are). So, we are left with the options of using a wildecard cert, or generating certificates for each box. The problem with using a wildcard cert is that when you forward requests over from standby vaults to active vaults it has to use the name of the server itself (we can't use the IP or we are back to the same issue above). Once that happens, if the boxes are not all in the same subdomain, you will need more than one wildcard cert.

One alternative is to use a cert for each box. However, in a dynamic environment, this becomes complex since you need to generate all those certificates dynamically and manage all the CA infrastructure involved. I realize that vault can handle that for us, but vault is not up yet....

In the end, I wound up creating an ELB instance for the vault servers, and then having that point to v1/sys/health. On the active box this returns a 200 http code, and on the standby boxes this returns a 429. In this way, I have a single url that always points to the active vault server, and a single cert on all vault servers. I then set the value of "advertise_addr" to that url, and everything works fine. This is, however, completely bypassing the entire "forwarding" functionality of vault in order to get this - fairly simple - config to work out.

All of this was done via fairly frustrating trial and error, and while it works fine, is clearly not how you guys intended things to be used. So I definitely agree with @blalor that some sort of supplement to the existing docs would be useful to figure out how this is supposed to be set up, and how it can work in dynamic environments.

jefferai commented 8 years ago

@dansteen Thanks for sharing. I agree that some documentation with common usage schemes would be nice (although in the interim the mailing list is usually quite helpful). Some comments:

One alternative is to use a cert for each box. However, in a dynamic environment, this becomes complex since you need to generate all those certificates dynamically and manage all the CA infrastructure involved. I realize that vault can handle that for us, but vault is not up yet....

I know that this is rather contrary to the way that people usually want to manage services these days but we generally recommend that Vault not be dynamically managed (auto-scaling group, etc). There are two main reasons:

Unsealing is a manual process (we've yet to see a scheme that unseals it in an automatic way that isn't simply shifting a security issue elsewhere), so by having Vault servers go up and down dynamically it creates a headache for unsealing
By having the Vault configuration be static it aids with bootstrapping the rest of the dynamic environment with secrets

In the end, I wound up creating an ELB instance for the vault servers, and then having that point to v1/sys/health. On the active box this returns a 200 http code, and on the standby boxes this returns a 429. In this way, I have a single url that always points to the active vault server, and a single cert on all vault servers. I then set the value of "advertise_addr" to that url, and everything works fine. This is, however, completely bypassing the entire "forwarding" functionality of vault in order to get this - fairly simple - config to work out.

Actually -- this is bypassing the forwarding functionality, but it's one of our recommended approaches! The reason is that the forwarding is only a convenience to point clients to the active node, rather than return an error (e.g. "sorry, standby node"). If you have a load balancer in front that is performing that functionality by directing TCP or HTTPS streams on the fly, your standbys won't be hit in the first place. That doesn't mean you shouldn't configure advertise_addr in case e.g. internal clients access a server directly. But this is definitely not a strange use case, and for those that have stricter certificate requirements, and even for many users that don't, it's definitely a supported and recommended approach. It's not at all clearly not how you guys intended things to be used!

dansteen commented 8 years ago

Ok. Good to know that is more or less what you guys had in mind.

know that this is rather contrary to the way that people usually want to manage services these days but we generally recommend that Vault not be dynamically managed (auto-scaling group, etc).

I definitely understand this consideration. However, I would still like the vault servers to be built using the same mechanism as everything else in our infrastructure (chef in our case). What I have wound up doing, is storing some tokens / secrets in data bags for testing purposes, and then checking for those when the cookbooks build the servers. If it finds secrets for the environment it starts up vault and does it's thing. Otherwise skips the vault management chef resources until the vault is unsealed.

This way, I can still build the vault servers using chef, and there is just the manual step of unsealing the vault which does not cause any chef failures while waiting for that to happen.

In any event, I think vault solves a huge number of problems for us, and it's definitely the way we are going. At this point, however, there are definitely some growing pains to migration. On the other hand, it's only v0.4 so that's to be expected! :-)

jefferai commented 8 years ago

It's 0.5 now :-D There are a huge number of improvements, see https://www.hashicorp.com/blog/vault-0.5.html

rmt commented 8 years ago

How about adding an environment variable called VAULT_TLS_SERVER_NAME to api/client.go?

If set, it would set clientTLSConfig.ServerName to that value.

margueritepd commented 8 years ago

After reading this discussion, I'm still unsure how to use Consul's service discovery with a TLS-enabled Vault server whose certificate is obtained from a public CA. Is the recommended approach to not to query Consul's DNS for vault.service.consul and to instead use a load balancer to do health checks and route to the master?

margueritepd commented 8 years ago

Hmm looks like I can override the domain config for Consul. Then I can register a cert for vault.service.consul.<domain> and query that through consul's DNS interface.

jefferai commented 8 years ago

That's certainly one way to do it!

hashicorp / vault

Better documentation on configuring TLS #764