percona / percona-server-mongodb-operator

Percona Operator for MongoDB
https://www.percona.com/doc/kubernetes-operator-for-psmongodb/
Apache License 2.0
316 stars 138 forks source link

K8SPSMDB-755: Fix enabling/disabling TLS in a running cluster #1536

Closed egegunes closed 2 months ago

egegunes commented 3 months ago

K8SPSMDB-755 Powered by Pull Request Badge

CHANGE DESCRIPTION

This commit fixes two problems:

  1. --sslAllowInvalidCertificates is not removed from mongos args if TLS is disabled.
  2. Disabling or enabling TLS on a running cluster is not working.

Fix for 1 was straightforward, since it was an oversight. Fix for 2 is another story...

First of all, disabling TLS means setting spec.tls.mode to disabled when it was one of allowTLS, preferTLS and requireTLS. Enabling TLS means setting spec.tls.mode to one of *TLS options when it was disabled.

Enabling and disabling didn't work because it causes a discrepancy between the TLS mode used by mongod processes running in cluster and the one set in cr.yaml. User changes spec.tls.mode and operator immediately thinks it's the actual one and decide whether to use TLS or not based on that. This causes connection errors since operator can think TLS is disabled by looking at the cr.yaml even though it's actually enforced in running processes.

To fix this issue, now operator detects the effective TLS mode by looking at running container args and decides whether to use TLS or not based on --tlsMode flag of mongod process.

But this doesn't fix all the problems...

During smart update, operator opens a PBM connection to the database to check if there's an operation running. In a sharded cluster, PBM first opens a connection to replset to get configsvrConnectionString in system.version collection and then opens a connection to config server replset since PBM's collections stored there. When user sets spec.tls.mode to disabled, operator first updates config server statefulset and set --tlsMode=disabled. Then it updates rs0 statefulset and during the smart update it correctly detects effective TLS mode is not disabled and use TLS certificates to open PBM connection. But then PBM tries to open connection to config server by using the same TLS certificates and boom! connection fails.

To fix this issue we decided to restart the whole cluster (pause and unpause) when users enables or disables TLS. So all pods will be terminated first and then recreated with new TLS mode.

To communicate the need of full cluster restart, we decided to introduce a new annotation: percona.com/restart-cluster. When operator detects TLS mode is changed from disabled or to disabled, it annotates the CR to add percona.com/restart-cluster and percona.com/update-mongos-first (it's required to terminate mongos pods first). After cluster reaches the paused state, it removes this annotations.

CHECKLIST

Jira

Tests

Config/Logging/Testability

JNKPercona commented 2 months ago
Test name Status
arbiter passed
balancer passed
custom-replset-name passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
multi-cluster-service passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
pvc-resize passed
recover-no-primary passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 48 out of 48

commit: https://github.com/percona/percona-server-mongodb-operator/pull/1536/commits/4cc529f8d01fab946e8038eb4ad1d1e7fce2068d image: perconalab/percona-server-mongodb-operator:PR-1536-4cc529f8

egegunes commented 2 months ago

Supporting enabling/disabling TLS automatically is not worth this complexity. Closing the ticket.