hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.34k stars 4.42k forks source link

A way to force or reset consul CA root during leadership failure scenario. #6375

Open ericbrumfield opened 5 years ago

ericbrumfield commented 5 years ago

While testing and feeling out consul we got it configured with consul connect ca's vault provider and things worked well, however at one point we assumed that we could empty vault and that consul would be able to setup/change the root CA that is baked into the raft data. Once our test consul cluster was in this state, when coming online it would fall in a really fast loop failing to establish leadership with the following error repeated from the server nodes:

consul: failed to establish leadership: stored CA root "06:e7:b6:ab:8f:93:c2:50:45:bf:b1:8c:b6:75:74:8f:52:dd:47:85" is not the active root (f1:40:88:39:b7:ef:39:7e:28:ed:4d:f7:89:45:22:5f:75:06:e2:4c)

After a lot of hunting through docs and trying different ways to force a leader and get the certificate rolled or switched out we ended up just rebuilding the 3 server nodes to fix this. I think we learned our lesson to never mess around with the vault pki mounts that consul connect ca uses, otherwise the cluster gets into this state and it doesn't seem like you can ever bring it back online. Where it's stuck electing a leader it doesn't seem you can even work with a server node to attempt to fix or roll the CA cert out for a new one. It's actually quite easy to mess this up, all one has to do is mess with the pki mount in vault that consul connect ca is configured to use.

Are there any plans to force, expunge or get rid of the root CA in consul in a scenario like this in order to get things running again and a leader elected? Possibly a way to "re-bootstrap" the consul CA bits?

stale[bot] commented 5 years ago

Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well! Thank you!

jstachowiak commented 3 years ago

This also affected us when we stopped ACL replication between datacenters by changing the primary_datacenter parameter. Consul was failing to establish leadership complaining about stored CA root. As a workaround we disabled Consul Connect but this didn't resolve the underlying issue. When we enabled it again we saw frequent failed leadership elections but this time without the error message.

Consul v1.9.0
Revision a417fe510
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
jstachowiak commented 3 years ago

I would probably consider this a bug because after you disable ACL replication the built-in CA generates and stores the new root certificate but the ActiveRootID still points to the primary root certificate. This causes frequent failed leadership elections which make it impossible to trigger a rotation process in the hope of updating the active root certificate.