Open ebocchi opened 4 years ago
We have been experiencing the same issue with 1.7.3->1.8.3 upgrade when doing it in rolling manner
Facing same issue while upgrading 1.7.3->1.8.3. We are unable to reproduce this issue on dev environments.
I have the same problem, no one can help??
up
Overview of the Issue
Consul was unable to elect the cluster leader when upgrading from 1.7.2 to 1.8.4 in a cluster made of 3 hosts. The upgrade was performed by installing the newer version of the consul binary and restarting the service, one host at a time.
The inability to elect the leader appeared after the first upgrade and restart. This caused the consul KV store to be unavailable together with many administrative commands (e.g.,
consul operator raft list-peers
mentioned in the outage recovery guide). Theconsul member
command was alternatively returning error or the list of peer servers and clients.Rolling back the upgraded node to 1.7.2 did not fix the problem and caused the process to panic. The issue was fixed by upgrading all the server nodes to 1.8.4. At this stage, clients were still running 1.7.2 and were working fine. They have been progressively upgraded to 1.8.4 as well.
Reproduction Steps
This problem was observed on one active cluster. Attempts to reproduce it on a second testing cluster were unsuccessful. The two clusters share the same configuration and software versions. The infrastructure underneath is decoupled.
Operating system and Environment details
CentOS 7.8.2003 on VM. Consul 1.7.2 upgraded to 1.8.4
Log Fragments
Starting new version on one of the three servers (here 'server1'):
Refuting alive messages and detecting no cluster leader:
Continues in a loop to self-elect as leader but does not sync with the other servers in the cluster:
Reverting to 1.7.2 causes panic: