hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.28k stars 4.42k forks source link

Improve ACL errors context to make figuring workable ACLs out simpler #10830

Open chrisjohnson opened 3 years ago

chrisjohnson commented 3 years ago

ACLs in consul are a mess to navigate through. Any given agent may be making requests with half a dozen different tokens (acl.tokens.default, acl.tokens.agent, service.token, consul connect envoy ...) and it's not clear which token is being used when a permission error occurs. It also doesn't help me to know which permission is missing.

The guides and documentation are not super well organized around ACLs to even know which permissions are going to be needed for which purposes, so ACLs are already a game of whackamole. The error messages giving more context would be a huge step forward; at least then I know where the moles are to be able to whack them.

Feature Description

In general, every "permission denied" error should show the Accessor ID and "slot" (acl.tokens.default, acl.tokens.agent, etc) that the token came from, to hint to the user which type of request uses which token "slot"

jkirschner-hashicorp commented 2 years ago

@chrisjohnson: this is a mock-up of a potential direction we could take to make this better. Because it hasn't been vetted by others yet (engineering, design), it may need to change. However, I wanted to share it in this early form to get feedback from you (and anyone else viewing this):

How does this compare to what you were hoping for?

This just shows potential CLI changes. There would be corresponding HTTP API changes. And we'd also want to reflect this info on the GUI. Do you have a sense of how you'd prefer to explore such information, and why? (HTTP API, CLI, or GUI?)

Workflow for Resolving a Token

Task: A user sees an ACL permission denied error indicating that the ACL token used for the operation lacks the appropriate access to the requested resource. The user knows that they have namespace default policies in place. The user wants to troubleshoot - to understand why access was denied and change things so access will be approved next time.

Current Workflow

Question: Is this an accurate reflection of the process today? What should I correct?

  1. See generic “Permission Denied” error message
  2. Somehow infer which token was used
  3. Somehow infer which operation failed, consult the docs for which permission is missing
  4. Inspect the token, get back the policies, node identities, service identities, and roles
    1. Inspect all the policies, get back rules
    2. Inspect all the roles, get back policies
      1. Inspect all the policies, get back rules
  5. Notice that the token output doesn’t show the namespace defaults. Run consul namespace read ns, get back the policies and roles
    1. Inspect all the policies, get back rules
    2. Inspect all the roles, get back policies
      1. Inspect all the policies, get back rules
  6. Possibly check the default policy (deny or accept)
  7. Manually review all the information from steps 4-6 to try to understand why permission was denied
  8. If you can’t understand why permission was denied, return to step 2 or 3 because you might have been wrong

Proposed Workflow

  1. See a more detailed “Permission Denied” error message describing (1) what permission was lacking on (2) which resource and (3) how to get more information on why this is the case for the provided token.
  2. Use a CLI command to understand the compiled permissions of that token and how they apply to the resource in step 1.
  3. Based on step 2, modify policies as needed to obtain the necessary permissions, or use a different token (which can be checked using the CLI command in step 2).

Error Message Improvement (Permission Denied)

Current Message

Just says "Permission denied":

2021-07-15T17:03:28.642-0400 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=testing-ns/
myservice-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: rpc error making call:
Permission denied"

Proposed Message

Provides additional information, including:

2021-07-15T17:03:28.642-0400 [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=testing-ns/
myservice-sidecar-proxy id=leaf error="error filling agent cache: rpc error making call: rpc error making call:
Permission denied: ACL token from agent config entry 'acl.tokens.default' lacks permission 'service:read' on service
'myservice-sidecar-proxy'; for more info, run: consul acl access explain
-token=2b58e043-178d-8f43-fb74-4ef511f3c0ac -resource=service -label='myservice-sidecar-proxy'"

Question: thoughts on this revised message? Any concerns about it? Info you think is important but missing?

Understanding an Authorization Enforcement Decision

The current process is to use the "Current Workflow" steps 2-8. The new process would add utilities for "Proposed Workflow" steps 2-3. (Long-term, this information would be easiest to present in a GUI, but it would likely start with a CLI command.)

Explain access for a given token, resource, and label

The proposed error message above tells you what you need to run this command to explain a "Permission denied":

for more info, run: consul acl authorizer explain -token=2b58e043-178d-8f43-fb74-4ef511f3c0ac -resource=service -label='myservice-sidecar-proxy'"

Provide the following details about the decision:

Questions:

$ consul acl token access explain -id=<token> -resource=key -label=admin/secret
Access Level:   deny
Enforcer:
  Type: role - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - role name
        -> policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
  Enforcement Layer: Token (1/3)
  Rule:
    namespace_prefix "" > key_prefix "admin/" {
      policy = "deny"
    }
Overridden 1:
  Override Reason: enforcer's "deny" takes precedence over "read"
  Type: policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
  Enforcement Layer: Token (1/3)
  Rule:
    namespace_prefix "" > key_prefix "admin/" {
      policy = "read"
    }
Overridden 2:
  Override Reason: enforcer's rule has a longer prefix match
  Type: policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
  Enforcement Layer: Token (1/3)
  Rule:
    namespace_prefix "" > key_prefix "" {
      policy = "list"
    }
Overridden 3:
  Override Reason: enforcer's match occurs at a higher layer (Token - 1/3)
  Type: policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name
  Enforcement Layer: Namespace Default (2/3)
  Rule:
    namespace_prefix "" > key_prefix "" {
      policy = "deny"
    }
Overridden 4:
  Override Reason: enforcer's match occurs at a higher layer (Token - 1/3)
  Type: Default Policy
  Enforcement Layer: Default Policy (3/3)
  Rule:
    namespace_prefix "" > key_prefix "" {
      policy = "deny"
    }

Read the resolved access for a given token

The command described above is primarily intended to explain an enforcement decision (such as in response to an error message). This command is instead focused on explaining the compiled ruleset for a token and how to modify it (based on the policy source).

The usage shown below is the full output. It could also allow filtering by a resource (exact match or prefix) or namespace (exact match or prefix) to make it easier to view only what you need.

Questions:

$ consul acl token access read -namespace=testing-rx3 -id=2b58e043-178d-8f43-fb74-4ef511f3c0ac
Token:
  AccessorID:       2b58e043-178d-8f43-fb74-4ef511f3c0ac
  SecretID:         xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  Namespace:        testing-rx3
  Description:      service.token -- mitrx3n1.cmmint.net
  Local:            false
  Create Time:      2021-10-07 10:35:40.125000128 -0400 -0400
Rules:
  Resource “key”:
    Layer 0: Token
      namespace_prefix “”:
        key_prefix “”: list (role - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - role name)
        key_prefix “admin/”: deny (policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name)
        key “admin/test”: read (policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name)
        key “app/”: write (policy - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - policy name)

    Layer 1: Namespace Defaults

    Layer 2: Default Policy
      namespace_prefix “”:
        key_prefix “”: deny

  Resource “node”:
    Layer 0: Token
    ... enumerate the resolved rules as above for Resource "key" ...

  ... enumerate the other types of resources ...      
jkirschner-hashicorp commented 2 years ago

Potential call signature improvement...

Original proposal:

$ consul acl token access explain -id=<token> -resource=key -label=admin/secret

Alternative:

$ consul acl token access explain -id=<token> -key=admin/secret
$ consul acl token access explain -id=<token> -service=myservice-sidecar-proxy
                                        ...   -<resource type>=<resource label>
jkirschner-hashicorp commented 2 years ago

@chrisjohnson : Consul 1.12 will include more verbose ACL error messages!

Instead of just Permission denied, they will be something like:

Permission denied: token with AccessorID '8a2d52a0-6b41-7077-8374-09d4fafa2d30' lacks permission 'service:read' on "foobar" in partition "foo", namespace "bar"

I'll leave this issue open because there are further improvements that could be made in the future (stating the "slot" that a token comes from, the "explain" functionality).

Relevant PRs: #12308, #12470, #12550, #12567, #12597, #12620