hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.39k stars 4.43k forks source link

Consul Upgrade with Replicate Results in Missing KVs #8351

Closed alkalinecoffee closed 3 years ago

alkalinecoffee commented 4 years ago

Overview of the Issue

We have three datacenters running 1.7.2:

us-west-2 (primary DC)

us-east-1 (runs consul-replicate)
us-east-2 (runs consul-replicate)

We were hoping to upgrade to 1.7.3 to avoid the bug described at https://github.com/hashicorp/consul/issues/7396.

We use consul-replicate v0.4.0 (886abcc) to replicate a subset keys from us-west-2 into the other datacenters (ie consul-replicate -prefix "apps@us-west-2").

  1. We created a new 3-node stack in us-east-1 running 1.7.3 and joined them to the existing 3-node 1.7.2 cluster
  2. Once the new nodes joined the cluster, we noticed that any services we run that use consul-template began to fail with invalid configurations (null/missing KVs, etc)
  3. We then immediately deactivated the new stack, reverting the cluster back to 1.7.2

Upon investigation, we noticed that the /apps folder in us-east-1 no longer appeared in the UI, yet consul-replicate was logging out the following lines to syslog:

2020/07/21 15:24:37.393702 [DEBUG] (runner) skipping because "apps/monitor/build/healthcheck-path" is already replicated
2020/07/21 15:24:37.393717 [DEBUG] (runner) skipping because "apps/monitor/build/https-enabled" is already replicated
2020/07/21 15:24:37.393731 [DEBUG] (runner) skipping because "apps/monitor/build/metrics-path" is already replicated

To clear this odd state out, we tried deleting the key path in us-east-1:

consul kv delete /apps
Success! Deleted key: apps

We restarted consul-replicate again, but the same already replicated messages appeared in the logs. We ended up re-importing the KVs from a backup file which got us back to a healthy state again.

Key Takeaways

Open Questions

Operating system and Environment details

Amazon Linux 1, EC2

ChipV223 commented 3 years ago

Hi @alkalinecoffee !

I just tried to repro this issue on my end and after the upgrade of the first DC, the Consul Replicate process running in DC2 was still operational. Here is how I set up my repro:

I'll close this for now since it's been a while since the last response and I've not been able to repro this behavior. But do feel free to drop a comment if you are still seeing this behavior and I can reopen & look into it further