gRPC should use acl_token implicitly

dmwilcox commented 5 years ago

Feature Description

Unlike HTTP requests which will implicitly use the 'acl_token' defined in a Consul clients config -- gRPC must explicitly provide 'x-consul-token'. This appears to be the case only in clusters with default ACL policy "deny".

Currently known to affect 1.3.1.

Use Case(s)

When running an Envoy container as a Consul Connect side-car -- providing a Consul token means putting it in a job file (not ideal), in the Consul service definition (also bad), or hard-coding it in the container (much worse).

If gRPC acted like HTTP and could use 'acl_token' when a query showed up at the localhost sockets token handling would be much safer. Per host tokens are extremely inconvenient currently as they must be defined in a job file/etc.

banks commented 5 years ago

@dmwilcox thanks for this.

The reason it doesn't work the same way is that gRPC is only used for Envoy's discovery API. In general using agent's default token for Connect is insecure - it's only secure if you have a one to one mapping of agent instances and service instances. Any other way grants any process that can talk to an agent the ability to act as the service who's ACL is used as acl_token.

While agent-per-service instance might be valid if you run traditional VMs with only one logical thing on each (this is the case that acl_token is useful in general). We don't see that being a good pattern for schedulers like Kube and Nomad because you then have to have a Consul agent inside every pod.

If the agent is outside the pod then the proposal you made is completely insecure of course - you'd get all pods on the host sharing the same ACL token regardless of service which means one token needs access to all (and in practice that means every pod can act as any other pod they happpen to be scheduled with).

providing a Consul token means putting it in a job file (not ideal), in the Consul service definition (also bad), or hard-coding it in the container (much worse).

I don't see that as the exhaustive list of options. Fundamentally every pod needs some kind of secret (ACL token) in order to authorize it to claim to be a particular service to Consul. Somehow you have to get that secret into the pod. From that point it's up to the pod's init process etc to get that secret (say it's in ENV) into the right places (e.g. the Envoy bootstrap).

We are working on changes and tooling that will make all of this much easier to do in k8s, but fundamentally you still need to have the secret from the pod to authorize or the system is not secure.

in the Consul service definition (also bad)

This also depends on how and where you are registering services - if for example you are using consul services register in an init container and/or you are registering directly in your app on startup then you are free to use the API and pull the token from ENV which seems to be exactly what you want!

Can you describe your setup, I've assumed Kubernetes like or similar just from use of "side car: and "containers" but I realise that's a big assumption. What config goes where, what is registering with Consul etc? That might help get to a solution that's workable. But having gRPC accept ACL tokens globally for the whole agent is unlikely to be something we'd want to implement since it breaks the security of our recommended container deployment model (shared agent per node not single one per pod).

Most of this answer applies to the scheduler/multi-tennant container case. For single-tennant (i.e. single service per host/VM) case there is a possible valid request here, although I'd like to understand why no other option is suitable before changing it because it becomes another confusing option that is totally insecure for majority of setups we see...

dmwilcox commented 5 years ago

Thanks @banks for the detailed response. I'm experimenting with Nomad + Rkt + Envoy.

I understand the desire to make things as secure by default as possible -- the reason this became a feature request is because of how inconsistent it is -- that gRPC has different behavior from HTTP.

It would also be extremely helpful if this requirement that gRPC be more secure than HTTP for the same service registration (both localhost sockets only) if Hashicorp were providing a good mechanism to request these tokens. Such as a Vault secrets backend that could protect the more privileged token needed to create these per service+sidecar tokens.

In the meantime I've settled for pulling a token from the environment but it is sub-optimal as it neither provides additional security -- as automating creation new tokens would -- nor is flexible as simply using the acl_token already configured on the host (only available as a localhost socket). Also once the containers can float dynamically I will need a better solution (one of the two above most likely).

You can read more about our setup in our support ticket that lead to me opening this issue, here: https://support.hashicorp.com/hc/en-us/requests/12685

Thanks for your thoughts on this -- I definitely want to get us to issuing per job tokens but it will take some additional work.

banks commented 5 years ago

It would also be extremely helpful if this requirement that gRPC be more secure than HTTP for the same service registration

In a sense gRPC is not more secure than HTTP:

the only thing gRPC can do is request proxy config for a sidecar including Leaf certificates for a service
if you do the equivalent with HTTP via /v1/agent/connect/ca/leaf/:service you also need a specific token and won't get to rely on the acl_token on the agent. Unlike all other HTTP endpoints things under /v1/agent are modifying local agent state (which may or may not result in changes reported to server) rather than just being proxied directly through to the servers (in which case we add the token by default).

I agree that's not super obvious from docs but in general that is the requirement to make Connect meaningfully secure.

... if Hashicorp were providing a good mechanism to request these tokens. Such as a Vault secrets backend ...

The existing Vault secrets backend can in theory work just fine for this task. In practice setting it up would require a new role being created for each service which may be more work than you imagine but actually creating unique tokens for each workload is possible with the existing Vault backend: https://www.vaultproject.io/docs/secrets/consul/index.html.

That said, we are certainly not happy with the complexity here and are working as a major priority on a whole new set of integrations that will make setting this up securely much easier (not to preempt the design but likely able to for instance consume a secret that's already provided by the platform (e.g. Kube SA JWT) and automatically create ACL tokens with the right access to the right services based on a templated policy.

dmwilcox commented 5 years ago

Thanks @banks for the suggestions. I've been playing around with the Consul backend in Vault quite a lot and it almost does what would be needed. Namely templating of consul policies so you could restrict things based on at least the node, if not the service, dynamically.

There is also the matter that Consul tokens from this backend expire which is quite unfortunate since unlike Vault tokens there don't appear to be any good ways to renew them.

Let me know if you have any suggestions for this as I don't see a clean way to implement the production suggestions for Consul Connect without a huge amount of manual work. Thank you!

banks commented 5 years ago

Yes Vault rotation is currently painful although possible with the right tooling (e.g. re-registering services with correct token periodically either in file or via API). We have plans to improve that story as part of the ACL auth methods that launched in 1.5.0 last week.

Ultimately that work is what will make securing connect practical once it expands to support auto-rotation (we have plans for tooling that should help make this easy rather than a problem left to the user) and to identity providers wider than Kubernetes (via Vault). Documentation for it is sparse but should be worked on this week! Part of it though allows for policies needed for Connect to be automatically generated and assigned when presenting a simpler identity token from another provider.

I'll try to remember to follow up here once we have a more complete documentation.

On Wed, May 8, 2019 at 2:58 AM dmwilcox notifications@github.com wrote:

Thanks @banks https://github.com/banks for the suggestions. I've been playing around with the Consul backend in Vault quite a lot and it almost does what would be needed. Namely templating of consul policies so you could restrict things based on at least the node, if not the service, dynamically.

There is also the matter that Consul tokens from this backend expire which is quite unfortunate since unlike Vault tokens there don't appear to be any good ways to renew them.

Let me know if you have any suggestions for this as I don't see a clean way to implement the production suggestions for Consul Connect without a huge amount of manual work. Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/5207#issuecomment-490317733, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA5QU2F6PSMOXPAEBHVQGLPUIXVJANCNFSM4GO7DJ5Q .

stale[bot] commented 5 years ago

Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well! Thank you!

stale[bot] commented 4 years ago

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one :+1:

ghost commented 4 years ago

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one :+1:.

hashicorp / consul

gRPC should use acl_token implicitly #5207

Feature Description

Use Case(s)