reactive-tech / kubegres

Kubegres is a Kubernetes operator allowing to deploy one or many clusters of PostgreSql instances and manage databases replication, failover and backup.
https://www.kubegres.io
Apache License 2.0
1.32k stars 74 forks source link

Wrong PVC node #137

Open coderazzi opened 2 years ago

coderazzi commented 2 years ago

Hi, I have been using kubegres 1.15 since January, on a AWS EKS cluster, with a one replica topology (backups to a EFS volume) In August, after a faulty EKS upgrade, I had to restart the whole cluster and recovered manually my databases from a previous backup. This created two pods: db-kubegres-1-0 (main) and db-kubegres-2-0 (replica). At some moment, the main instance failed, the replica was promoted, and a new replica created.

Curiously, db-kubegres-3-0 was created in a given node (ip-192-168-12-87.eu-west-2.compute.internal), but the associated PVC (postgres-db-db-kubegres-3-0) was created with the following metadata annotation: volume.kubernetes.io/selected-node: ip-192-168-69-151.eu-west-2.compute.internal

So, wrong node; in fact, this node is not in my cluster, although I logically assume it refers to a node that belonged to my cluster before.

The problem is that the PVC spawned then a PV on a zone (eu-west-2c), different from the zone where the POD was allocated (eu-west-2a). The result was that the POD failed to start: 0/2 nodes are available: 2 node(s) had volume node affinity conflict.

Removing the POD would recreate it again, reusing the same PVC, and failing in the same mode. I had to manually recreate the PVC with the correct annotation, and then restart the POD to have it working again.