hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

ACL replication breaks after upgrade from 1.9.5 to 1.14.3 #16273

Open kemko opened 1 year ago

kemko commented 1 year ago

Overview of the Issue

After upgrading Consul from 1.9.5 to 1.14.3, ACL replication breaks. It's fixed by some rather strange actions. We decided to file a bug report since we could not find any notes about this behavior in the documentation.

Reproduction Steps

  1. Deploy at least three Consul clusters on version 1.9.5, one of them must be declared as primary datacenter, the rest must be configured to replicate ACLs from it.
  2. Upgrade all clusters to 1.14.3. After that, secondary clusters will log ACL replication errors periodically. See log 1.
  3. In the web ui of any secondary datacenter, create an empty policy. This will temporarily fix the problem in that DC, but not in the others. After that, everything will break back without manual intervention. See log 2.
  4. Bind the empty policy from step 3 to any existing token. (For example, we bound it to the initial management token).

After that, the replication error will disappear for all datacenters and replication will work as expected.

Operating system and Environment details

Ubuntu 20.04.5 LTS, x86_64 GNU/Linux

Log Fragments

Log 1 ```log 2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1 2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1 2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates 2023-02-15T12:40:01.391+0300 [WARN] agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to update local ACL policies: Failed to apply policy upserts: node is not the leader" ```
Log 2 ```log 2023-02-15T13:10:29.486+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:10:29.486+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:10:29.487+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0 2023-02-15T13:10:29.487+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962 2023-02-15T13:12:44.368+0300 [INFO] agent.server.replication.acl.policy: started ACL Policy replication 2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0 2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962 2023-02-15T13:15:40.920+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0 2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962 2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0 2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962 2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0 2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962 2023-02-15T13:33:11.701+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:33:11.701+0300 [WARN] agent.server.replication.acl.policy: ACL replication remote index moved backwards, forcing a full ACL sync: from=1867938962 to=1692767365 2023-02-15T13:33:11.701+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:33:11.701+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1 2023-02-15T13:33:11.705+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1 2023-02-15T13:33:11.706+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates 2023-02-15T13:33:11.713+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=497 2023-02-15T13:33:11.713+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - finished updates 2023-02-15T13:33:11.713+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1692767365 2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0 2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962 2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0 2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962 2023-02-15T13:43:32.062+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27 2023-02-15T13:43:32.062+0300 [WARN] agent.server.replication.acl.policy: ACL replication remote index moved backwards, forcing a full ACL sync: from=1867938962 to=1692767365 2023-02-15T13:43:32.062+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27 2023-02-15T13:43:32.062+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1 2023-02-15T13:43:32.067+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1 2023-02-15T13:43:32.067+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates 2023-02-15T13:43:32.083+0300 [WARN] agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to update local ACL policies: Failed to apply policy upserts: Changing the Rules for the builtin global-management policy is not permitted" ```
huikang commented 1 year ago

Given the large gap between 1.9.x and 1.14.x, I am wondering if upgrading to an earlier version that helps narrow down reasoning about the root cause, like 1.9.x to 1.10.x ....

https://developer.hashicorp.com/consul/docs/upgrading/instructions

akotlyar commented 1 year ago

Got same error, replication not working between 1.12.8 and 1.13.7 If the primary and secondary use version 1.13.7 everything works, but if one of them 1.12.8 policy replycation get error

akotlyar commented 1 year ago

Policy replication does not work with any version 1.13.x if one of the DC is below version 1.13.x

akotlyar commented 1 year ago

Is there any solution to this problem? We use 8 datacenters and have always performed the update according to the following instructions: "Upgrade the Consul agents in all DCs to version 1.x.x by following our General Upgrade Process. This should be done one DC at a time, leaving the primary DC for last"

But this scheme does not work when upgrading from 1.12.8 to 1.3.7. The field of updating of the first DC at it replication ACL flies. Raised a test environment with 3 data centers and revealed the following - 1.12.8 (9) in principle, ACL synchronization with versions 1.3.x does not work If you update the primary DC to version 1.3.x, then replication crashes on all other DCs of version 1.12.8, and if only on one of the secondary ones, then it crashes on it.

The only option I see is to update the consul in all DCs at the same time, but this will affect more critical services, which I would not like. Is this update option intended or is it a bug? The Specific Version Details does not contain information about the change in the replication system in versions 1.3.x

garry-t commented 4 days ago

Got same case. upgrade from 1.11.4 -> 12.x.x -> 1.13.x -> 1.14.x -> 1.15.x -> 1.16.x->1.17.x->1.19.x, in the middle of upgraded version replication stopped working, all keys have been deleted in a secondary DC