hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

ACL Replication not working in v1.4.2 #5416

Closed ChipV223 closed 4 years ago

ChipV223 commented 5 years ago

Overview of the Issue

When setting up ACL replication on the non-ACL DC, I'm getting failed to sync remote state: rpc error making call: ACL not found errors coming up in the logs of the non-ACL DC nodes despite it using the same agent token used in the ACL DC

Reproduction Steps

  1. Create a 3-server node cluster with ACL enabled(i.e. Primary)
  2. Create a second 3-server node cluster(i..e. Secondary) and have it wan joined with the Primary DC
  3. Add the necessary ACL information in the config. In addition, you would want to create the policy and token for the ACL replication and add that information to the config:
vagrant@n9:/etc/consul.d$ consul_e acl policy create -name "replication-token" -description "ACL Replication Token Policy" -rules @replicate_policy.json
ID:           211fe092-a5ee-54f0-4699-d0836d282982
Name:         replication-token
Description:  ACL Replication Token Policy
Datacenters:
Rules:
acl_prefix "" {
   policy = "write"
}

vagrant@n9:/etc/consul.d$ consul_e acl token create -description "ACL Replication Token" -policy-name "replication-token"
AccessorID:   94ecb2f7-1d1e-f49f-9058-1ffdd70157fc
SecretID:     ebde48d9-4781-8ec1-096f-d95efd738d60
Description:  ACL Replication Token
Local:        false
Create Time:  2019-02-21 16:26:53.855119156 +0000 UTC
Policies:
   211fe092-a5ee-54f0-4699-d0836d282982 - replication-token
  1. Restart the nodes in the Secondary DC one at a time. Once all nodes have restarted, you should see the errors come up in the logs almost immediately

Consul info

Server info from the Primary DC nodes ``` vagrant@n9:/etc/consul.d$ consul_e info -token=testybesty agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = d282e472 version = 1.4.2+ent consul: acl = enabled bootstrap = false known_datacenters = 2 leader = true leader_addr = 172.20.14.18:8300 server = true license: customer = permanent expiration_time = 2049-02-23 19:02:13.257944622 +0000 UTC features = Automated Backups, Automated Upgrades, Enhanced Read Scalability, Network Segments, Redundancy Zone, Advanced Network Federation id = permanent install_id = * issue_time = 2019-03-03 19:02:13.257944622 +0000 UTC package = premium product = consul start_time = 2019-03-03 18:57:13.257944622 +0000 UTC raft: applied_index = 5317 commit_index = 5317 fsm_pending = 0 last_contact = 0 last_log_index = 5317 last_log_term = 17 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:4aef7c10-a8ed-dcb6-0269-a5526ca894fd Address:172.20.14.18:8300} {Suffrage:Voter ID:c53b8284-e65e-1c0c-6706-1a042bbd78ca Address:172.20.14.19:8300} {Suffrage:Voter ID:d389c1b4-fb08-a185-92c4-a2d082441586 Address:172.20.14.20:8300}] latest_configuration_index = 5172 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 17 runtime: arch = amd64 cpu_count = 1 goroutines = 134 max_procs = 1 os = linux version = go1.11.4 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 12 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 29 members = 3 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 103 members = 6 query_queue = 0 query_time = 1 ```
Server info from the Secondary nodes Not able to run 'consul info' due to the aforementioned issue, but here's the server config ``` { "bootstrap_expect": 3, "bind_addr": "0.0.0.0", "node_name": "Server2A", "server": true, "datacenter": "test2", "disable_remote_exec": true, "data_dir": "/tmp/consul_2", "log_level": "TRACE", "enable_syslog": true, "rejoin_after_leave": true, "disable_update_check": true, "client_addr": "0.0.0.0", "leave_on_terminate": true, "translate_wan_addrs": true, "ui": true, "domain": "chip.com", "advertise_addr": "172.20.14.21", "enable_debug": true, "retry_join": ["172.20.14.21","172.20.14.22","172.20.14.23"], "primary_datacenter": "test1", "acl": { "enabled": true, "default_policy": "deny", "down_policy": "extend-cache", "enable_token_replication": true, "tokens": { "agent":"af48238c-73ca-9a80-8a04-c6a5661d994f", "replication": "ebde48d9-4781-8ec1-096f-d95efd738d60" } } } ```

Operating system and Environment details

Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-31-generic x86_64)

Log Fragments

2019/03/03 19:08:03 [ERR] agent: failed to sync remote state: rpc error making call: ACL not found
    2019/03/03 19:08:10 [ERR] agent: Coordinate update error: rpc error making call: ACL not found
    2019/03/03 19:08:16 [DEBUG] memberlist: Initiating push/pull sync with: 172.20.14.23:8301
    2019/03/03 19:08:20 [ERR] agent: failed to sync remote state: rpc error making call: ACL not found
    2019/03/03 19:08:31 [ERR] agent: Coordinate update error: rpc error making call: ACL not found
    2019/03/03 19:08:36 [ERR] agent: failed to sync remote state: rpc error making call: ACL not found
    2019/03/03 19:08:46 [DEBUG] memberlist: Initiating push/pull sync with: 172.20.14.22:8301
    2019/03/03 19:08:55 [ERR] agent: Coordinate update error: rpc error making call: ACL not found
    2019/03/03 19:08:56 [DEBUG] memberlist: Stream connection from=172.20.14.19:59020
    2019/03/03 19:09:00 [DEBUG] memberlist: Stream connection from=172.20.14.23:45656
    2019/03/03 19:09:03 [DEBUG] memberlist: Stream connection from=172.20.14.18:42034
    2019/03/03 19:09:03 [ERR] agent: failed to sync remote state: rpc error making call: ACL not found
    2019/03/03 19:09:04 [DEBUG] memberlist: Stream connection from=172.20.14.22:46430
    2019/03/03 19:09:16 [DEBUG] memberlist: Initiating push/pull sync with: 172.20.14.22:8301
    2019/03/03 19:09:22 [ERR] agent: Coordinate update error: rpc error making call: ACL not found
rboyer commented 5 years ago

acl is not a prefix-able rule type, so acl_prefix "" { policy = "write" } is being ignored. If you update the rule to just be acl = "write" it should work.

mkeeler commented 5 years ago

So when @ChipV223 and I were working through this before there were a couple of issues.

1) Was that the rule was wrong which allowed replication to work and would fix things. 2) Somehow the FSM in the secondary DC contained token information. So the fallback mechanism to do remote token/policy resolution until the replication got worked out was not happening.

The first is user error but the second is something we should probably look into at some point. A custom binary I provided him made it check the local acl replication status instead of raft indices to determine whether it should do the fallback. That cannot be just PRed as is because that only works on the leader and not the other followers without doing an extra RPC to the leader to determine the replication status.

stale[bot] commented 5 years ago

Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well! Thank you!

stale[bot] commented 4 years ago

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one :+1:

ghost commented 4 years ago

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one :+1:.