hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
669 stars 322 forks source link

server-acl-init -create-client-token not utilizing secondary DC client-token policy #582

Open pedrohdz opened 3 years ago

pedrohdz commented 3 years ago

Community Note


Overview of the Issue

A new Consul client only Kubernetes cluster is failing to join a secondary DC. We are utilizing consul-helm with .Values.global.acls.manageSystemACLs enabled to deploy. The cause appears to be that the client-token ACL token is being associated with the client-token ACL policy of the primary DC, not the secondary.

The consul client logs are showing (kubectl logs consul-k7xm9):

2021-07-30T08:27:30.006Z [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=REDACTED:8300 error="rpcinsecure error making call: rpcinsecure error making call: Permission denied"
2021-07-30T08:27:30.016Z [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=REDACTED:8300 error="rpcinsecure error making call: rpcinsecure error making call: Permission denied"
2021-07-30T08:27:30.069Z [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=REDACTED:8300 error="rpcinsecure error making call: Permission denied"
2021-07-30T08:27:30.069Z [ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request

The client-token ACL policy (associated with the primary DC) is being utilized when creating a client token for a new Kubernetes client cluster that is configured to use the secondary DC as its server.

Partial output from kubectl logs consul-server-acl-init-5dhnn shows that client-token policy is being utilized, not client-token-REDACTED-dc2

2021-07-29T10:38:18.451Z [INFO]  Bootstrap token is provided so skipping Consul server ACL bootstrapping
2021-07-29T10:38:18.946Z [INFO]  Success: calling /agent/self to get datacenter
2021-07-29T10:38:19.146Z [INFO]  Current datacenter: datacenter=REDACTED-dc2 primaryDC=REDACTED-dc1
2021-07-29T10:38:19.215Z [INFO]  Policy "client-token" already exists, skipping update
2021-07-29T10:38:19.215Z [INFO]  Success: creating client-token policy
...

Listing the ACL policies on REDACTED-dc2 shows that client-token-REDACTED-dc2 exists.

$ consul acl policy list | grep 'client-token.*:'
client-token:
client-token-REDACTED-dc2:

$ consul acl policy read -name client-token-REDACTED-dc2
ID:           REDACTED
Name:         client-token-REDACTED-dc2
Description:  client-token-REDACTED-dc2 Token Policy
Datacenters:  REDACTED-dc2
Rules:
node_prefix "" {
    policy = "write"
  }
  service_prefix "" {
    policy = "read"
  }

REDACTED-dc2 is only being utilized by the Consul client running on dc2 Kubernetes cluster itself.

Taking a guess here, it seems like the issue is with the following, where DC name is not being appended to the policy name:

https://github.com/hashicorp/consul-k8s/blob/4a50fda5ab50fb2d6c99603b00a538ef432a6eed/subcommand/server-acl-init/create_or_update.go#L34-L38

Reproduction Steps

The general idea is:

  1. Create a Consul server deployment with both primary and secondary DCs using the consul-helm Helm chart. similar to the Secure Service Mesh Communication Across Kubernetes Clusters tututial with manageSystemACLs and TLS enableAutoEncrypt.
  2. Create a third Kubernetes cluster configured as client only, server.enabled: false, pointing it to the secondary DC.
  3. The Consul client pods will fail to start.

Logs

Logs provided above.

Expected behavior

consul-k8s server-acl-init -create-client-token should be associating the client-token ACL token with the client-token-REDACTED-dc2 ACL policy for the secondary DC, not the one for the primary DC.

Environment details

If not already included, please provide the following:

lkysow commented 3 years ago

Hi, do you have the Helm yamls for each DC? I think there is indeed likely to be an issue here.

pedrohdz commented 3 years ago

@lkysow I should be able to get the Helm value files some time next week. Thanks…

pedrohdz commented 3 years ago

@lkysow, The Helm Value files are in the following Gists:

Recreation instructions:

  1. Create DC1 and extract configuration data:

    kubectl --context=dc1 create namespace consul
    
    kubectl --context=dc1 --namespace=consul \
        create secret generic consul-gossip-encryption-key \
        --from-literal=key=$(consul keygen)
    
    wget 'https://gist.githubusercontent.com/pedrohdz/3e869f5b6dbfc3b49900c244ae67824e/raw/hashicorp+consul-k8s+issues+582+dc1-helm-values.yaml'
    
    helm --kube-context=dc1 --namespace=consul \
        install --values='hashicorp+consul-k8s+issues+582+dc1-helm-values.yaml' \
        consul hashicorp/consul --version="0.33.0" --wait
    
    kubectl --context=dc1 --namespace=consul \
        get secret consul-federation -o yaml > consul-federation-secret.yaml
    
    DEMO_CONSUL_BOOTSTRAP_TOKEN=$(kubectl \
        --context=dc1 --namespace=consul get secrets \
        consul-bootstrap-acl-token -o jsonpath='{.data.token}' | base64 -d)
  2. Create DC2:

    kubectl --context=dc2 create namespace consul
    
    kubectl \
        --context=dc2 --namespace=consul \
        apply --filename=consul-federation-secret.yaml
    
    wget 'https://gist.githubusercontent.com/pedrohdz/df324f75315a789eed53d0f331cf1d44/raw/hashicorp+consul-k8s+issues+582+dc2-helm-values.yaml'
    
    helm --kube-context=dc2 --namespace=consul \
        install --values=hashicorp+consul-k8s+issues+582+dc2-helm-values.yaml \
        consul hashicorp/consul --version="0.33.0" --wait
  3. Verify that everything is working:

    kubectl --context=dc1 --namespace=consul \
        exec statefulset/consul-server -- consul members -wan
  4. You are going to need to get the IP address of the server pod, assuming it has a network exposed IP address:

    DEMO_CONSUL_DC2_IP=$(kubectl --context=dc2 \
        --namespace=consul get pods consul-server-0 -o jsonpath='{.status.podIP}')
    echo $DEMO_CONSUL_DC2_IP
  5. Set up the DC2 client only cluster:

    kubectl --context=dc2-client-only create namespace consul
    
    kubectl --context=dc2-client-only --namespace=consul \
        apply --filename=consul-federation-secret.yaml
    
    kubectl --context=dc2-client-only --namespace=consul \
        create secret generic copied-bootstrap-token \
        --from-literal=token="$DEMO_CONSUL_BOOTSTRAP_TOKEN"
    
    wget 'https://gist.githubusercontent.com/pedrohdz/a5f59554464e7e8d60bcf2a34bab40d9/raw/hashicorp+consul-k8s+issues+582+dc2-client-only-helm-values.yaml'
    
    helm --kube-context=dc2-client-only --namespace=consul \
        install --values=hashicorp+consul-k8s+issues+582+dc2-client-only-helm-values.yaml \
        --set="client.join[0]=$DEMO_CONSUL_DC2_IP" \
        --set="externalServers.hosts[0]=$DEMO_CONSUL_DC2_IP" \
        consul hashicorp/consul --version "0.33.0" --wait
  6. View the client errors:

    kubectl --context=dc2-client-only --namespace=consul logs -l 'app=consul,component=client'

    Note that the error seems to have changed, but the root cause appears to be the same:

    2021-08-25T07:09:08.401Z [WARN]  agent: Node info update blocked by ACLs: node=YYYYYY-YYYYYYY-YYYYYYY accessorID=XXXXX-XXXXXXX-XXXXXXXX
    2021-08-25T07:09:22.535Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.31.246.59:8300 error="rpc error making call: Permission denied"
    2021-08-25T07:09:22.535Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=XXXXX-XXXXXXX-XXXXXXXX
    2021-08-25T07:09:46.500Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.31.246.59:8300 error="rpc error making call: Permission denied"
    2021-08-25T07:09:46.504Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=XXXXX-XXXXXXX-XXXXXXXX
    2021-08-25T07:10:03.211Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.31.246.59:8300 error="rpc error making call: Permission denied"
    2021-08-25T07:10:03.211Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=XXXXX-XXXXXXX-XXXXXXXX

If you add the client-token-dc2 policy to the client-token Token on dc2, the errors go away and it seems as if the nodes finish registering properly.

2021-08-25T09:23:01.502Z [INFO]  agent: Synced node info
2021-08-25T09:23:09.140Z [INFO]  agent: Synced node info
2021-08-25T09:23:36.600Z [INFO]  agent: Synced node info
lkysow commented 3 years ago

Okay so I think this is covered in this doc: https://www.consul.io/docs/k8s/installation/deployment-configurations/single-dc-multi-k8s

Note: The Helm release name must be unique for each Kubernetes cluster. That is because the Helm chart will use the Helm release name as a prefix for the ACL resources that it creates, such as tokens and auth methods. If the names of the Helm releases are the same, the Helm installation in subsequent clusters will clobber existing ACL resources.

I think if you use a different prefix in your hashicorp+consul-k8s+issues+582+dc2-client-only-helm-values.yaml, e.g. global.name: consul-clientonly then it will work.

pedrohdz commented 3 years ago

@lkysow, No dice, unfortunately..

The consul-k8s-control-plane server-acl-init call on the consul client cluster is still defaulting to creating a token with the description of client-token Token and associating it with the client-token policy, which is set for dc1 only, bot dc2.

It looks like the name of the token is hard-coded here: https://github.com/hashicorp/consul-k8s/blob/01d22a21cfc03960b29d97191ba1acebed5ede60/control-plane/subcommand/server-acl-init/command.go#L444-L448

Then this part fails to append the DC, which would associate the token with the client-token-dc2 policy: https://github.com/hashicorp/consul-k8s/blob/01d22a21cfc03960b29d97191ba1acebed5ede60/control-plane/subcommand/server-acl-init/create_or_update.go#L34-L38

I guess an option might be to utilize the -resource-prefix instead of hard-coding the name client. I'm not sure if that would break existing deployments though. Another is appending the DC to the policy name, which would likely be a better solution since it would minimize the number of policies being auto-created. The other option is that I'm totally missing something else. 🤓

I updated the Gists BTW.

pedrohdz commented 3 years ago

@lkysow, Question for you.. Should the global.name be set to different values for the two DCs (servers)? I currently have them set the same in the YAML files I provided.

Thanks!

lkysow commented 3 years ago

Ahhh I see what's happening. Yeah the client-only dc2 thinks it's not in federation mode because global.federation.enabled == false. Can you try setting that to true in the client-only dc2?

Question for you.. Should the global.name be set to different values for the two DCs (servers)? I currently have them set the same in the YAML files I provided.

No, those can be the same name. The restriction for different names is only when you're sharing a Consul DC across two kube clusters.

pedrohdz commented 3 years ago

Huh... That seemed to do the trick (Gist updated), although the following language is a little confusing in this case:

https://github.com/hashicorp/consul-k8s/blob/3521e3d292b00be5f4d7e8adb3cce2a098d963dd/charts/consul/values.yaml#L227-L234

One would imply that meshGateway.enabled must be true, which it is not. Or maybe I should be turning it on? I thought the mesh-gateways were only utilized when communicating between DC servers. I tried a release with meshGateway.enabled set to true and it seems to come up just fine.

lkysow commented 3 years ago

Hi, yes I agree it's confusing. Really what you're indicating by setting that in your client-only install values is that you're in a secondary DC. I think we might be able to get away without doing that check and then it should just work. I'll create a PR.