percona / percona-server-mongodb-operator

Percona Operator for MongoDB
https://www.percona.com/doc/kubernetes-operator-for-psmongodb/
Apache License 2.0
316 stars 138 forks source link

K8SPSMDB-733: Improve failure detection for PVC resize #1543

Closed egegunes closed 2 months ago

egegunes commented 2 months ago

K8SPSMDB-733 Powered by Pull Request Badge

CHANGE DESCRIPTION

Operator were reporting that PVC resize completed successfully even though it's failed. This commit fixes this behavior and introduce some improvements.

Operator tries to understand if resize operation in progress by checking the status.conditions of PVCs. If the condition says it's resizing the operation is in progress, if there's no condition then the operation completed successfully. This approach is naive since resize could be failed. With these changes operator will check events regarding to the PVC and check if VolumeResizeFailed event exists. If it does, resize operation will declared as failed.

If a resize operation fails, operator will automatically revert the PVC size in cr.yaml. This way the probability to have discrepancy between CR <-> STS <-> PVC will be lower. If resize operation fails only for some PVCs, the operator will still revert the PVC size in cr.yaml with the size of failed PVCs.

CHECKLIST

Jira

Tests

Config/Logging/Testability

JNKPercona commented 2 months ago
Test name Status
arbiter passed
balancer passed
custom-replset-name passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
multi-cluster-service failure
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
pvc-resize passed
recover-no-primary passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 48 out of 48

commit: https://github.com/percona/percona-server-mongodb-operator/pull/1543/commits/c8a64a7855fb50a5cda895a4ab70e119af59ae40 image: perconalab/percona-server-mongodb-operator:PR-1543-c8a64a78