hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.76k stars 782 forks source link

Simultaneous SSL update on all instances. #1704

Open scor2k opened 1 year ago

scor2k commented 1 year ago

Hi!

We use consul-template + Vault PKI to provide SSL certificates for the MySQL Galera cluster. We did some tests with short TTL (15m) for SSL and faced the issue when the Galera cluster crashed because of simultaneous SSL re-generation for all nodes (we send ALTER INSTANCE RELOAD TLS; via reload-script each time a new certificate has been done).

Also, we faced the same issue with the Apache Kafka cluster (with SSL) but TTL was 7 days. Honestly, it was only once for 1 month, but it has happened.

We applied a fix to shift TTL for 1 day for every next node, it helps to reduce the chance, but it's not a fix.

My question is simple: Any way you have some distributive lock (via Consul) to prevent all instances from updating certificates at the same time?

mysqld config x 3 instatces

$ cat my.cnf
[client]
port = 33306
socket = /tmp/mysql.sock
default-character-set = utf8

[mysqld]
pxc_encrypt_cluster_traffic=ON
user = mysql
ssl-ca = /opt/mysql/tls/server/server-ca.pem
ssl-cert = /opt/mysql/tls/server/server-cert.pem
ssl-key = /opt/mysql/tls/server/server-key.pem
...

consul-template configs x 3 instances

$ cat conf/consul-template.hcl
vault {
  address = "https://127.0.0.1:8200"

  unwrap_token = false
  renew_token  = true

  lease_renewal_threshold = 0.5

  ssl {
    enabled = true
    verify = true
    ca_path = "/opt/consul-template/tls/server-CA.cert"
    cert = "/opt/consul-template/tls/consul-template.cert"
    key = "/opt/consul-template/tls/consul-template.key"
    server_name = "127.0.0.1"
  }
}

# MYSQL
template {
  source = "/opt/consul-template/templates/mysql/server-ca.pem.tpl"
  destination = "/opt/mysql/tls/server/server-ca.pem"
  perms = 0640
  command = "/opt/consul-template/templates/mysql/reload.sh"
  error_on_missing_key = true
  left_delimiter  = "[["
  right_delimiter = "]]"
}
template {
  source = "/opt/consul-template/templates/mysql/server-cert.pem.tpl"
  destination = "/opt/mysql/tls/server/server-cert.pem"
  perms = 0640
  command = "/opt/consul-template/templates/mysql/reload.sh"
  error_on_missing_key = true
  left_delimiter  = "[["
  right_delimiter = "]]"
}
template {
  source = "/opt/consul-template/templates/mysql/server-key.pem.tpl"
  destination = "/opt/mysql/tls/server/server-key.pem"
  perms = 0640
  command = "/opt/consul-template/templates/mysql/reload.sh"
  error_on_missing_key = true
  left_delimiter  = "[["
  right_delimiter = "]]"
}

set -eo pipefail

STATUS=0

if [ -f '/opt/mysql/current/bin/mysql' -a -S '/tmp/mysql.sock' ]; then echo "ALTER INSTANCE RELOAD TLS;" | /opt/mysql/current/bin/mysql -u root -p'super-secure-password' STATUS=$? fi

exit $STATUS

komapa commented 1 year ago

Why not add sleep $((1 + $RANDOM % 360)); to your mysql/reload.sh command? (obviously you can adjust the 360 seconds to whatever suits your needs. Your certificates will be rotated before they expire so you do not have to update right away when you generate the new ones.

scor2k commented 1 year ago

Why not add sleep $((1 + $RANDOM % 360)); to your mysql/reload.sh command? (obviously you can adjust the 360 seconds to whatever suits your needs. Your certificates will be rotated before they expire so you do not have to update right away when you generate the new ones.

Thank you for your reply, @komapa. Yes, it's a possible solution, but it's can guarantee nothing. We did the same by adjusting every next node's TTL to one (hour, day), but it also won't protect us in the case of bad luck.

komapa commented 1 year ago

I think you pretty much said this?

command = "consul lock -child-exit-code /some/consul/path/prefix /opt/consul-template/templates/mysql/reload.sh"

See: https://developer.hashicorp.com/consul/commands/lock