reactive-tech / kubegres

Kubegres is a Kubernetes operator allowing to deploy one or many clusters of PostgreSql instances and manage databases replication, failover and backup.
https://www.kubegres.io
Apache License 2.0
1.32k stars 74 forks source link

CrashLoopbackOff & ReplicaStatefulSetDeploymentTimedOutErr on replicates #106

Closed bastoune closed 2 years ago

bastoune commented 2 years ago

After following this tutorial: https://www.kubegres.io/doc/getting-started.html

The first replicate of the cluster is restarting with CrashLoopbackOff reason.

Please see here the following state : kubectl get pod,statefulset,svc,configmap,pv,pvc -o wide -n heyliot-postgres

NAME                  READY   STATUS             RESTARTS   AGE   IP            NODE                                  NOMINATED NODE   READINESS GATES
pod/postgres-db-1-0   1/1     Running            1          59m   10.244.1.83   aks-userpool3-19642675-vmss000002     <none>           <none>
pod/postgres-db-2-0   0/1     CrashLoopBackOff   10         27m   10.244.7.32   aks-sysnodepool-27761590-vmss000008   <none>           <none>

NAME                             READY   AGE   CONTAINERS      IMAGES
statefulset.apps/postgres-db-1   1/1     59m   postgres-db-1   postgres:11.15
statefulset.apps/postgres-db-2   0/1     58m   postgres-db-2   postgres:11.15

NAME                  TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE   SELECTOR
service/postgres      LoadBalancer   10.0.110.170   20.23.0.186   5432:30942/TCP   24m   app=postgres-db
service/postgres-db   ClusterIP      None           <none>        5432/TCP         58m   app=postgres-db,replicationRole=primary

NAME                             DATA   AGE
configmap/base-kubegres-config   7      59m
configmap/kube-root-ca.crt       1      63m

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                          STORAGECLASS   REASON   AGE    VOLUMEMODE
persistentvolume/pvc-47937ca1-a31c-4e84-a834-79d31ddb2b93   1Gi        RWO            Delete           Bound    heyliot-postgres/postgres-db-postgres-db-2-0   default                 58m    Filesystem
persistentvolume/pvc-be4c082e-a4f0-45f6-bae1-95dd3cc0fd87   1Gi        RWO            Delete           Bound    heyliot-postgres/postgres-db-postgres-db-1-0   default                 59m    Filesystem

NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE   VOLUMEMODE
persistentvolumeclaim/postgres-db-postgres-db-1-0   Bound    pvc-be4c082e-a4f0-45f6-bae1-95dd3cc0fd87   1Gi        RWO            default        59m   Filesystem
persistentvolumeclaim/postgres-db-postgres-db-2-0   Bound    pvc-47937ca1-a31c-4e84-a834-79d31ddb2b93   1Gi        RWO            default        58m   Filesystem

kubectl get events -n heyliot-postgres

LAST SEEN   TYPE      REASON                                    OBJECT                                                            MESSAGE
28m         Normal    Scheduled                                 pod/cm-acme-http-solver-mcjhp                                     Successfully assigned heyliot-postgres/cm-acme-http-solver-mcjhp to aks-userpool3-19642675-vmss000002
28m         Normal    Pulling                                   pod/cm-acme-http-solver-mcjhp                                     Pulling image "quay.io/jetstack/cert-manager-acmesolver:v1.3.1"
28m         Normal    Pulled                                    pod/cm-acme-http-solver-mcjhp                                     Successfully pulled image "quay.io/jetstack/cert-manager-acmesolver:v1.3.1" in 2.081205621s
28m         Normal    Created                                   pod/cm-acme-http-solver-mcjhp                                     Created container acmesolver
28m         Normal    Started                                   pod/cm-acme-http-solver-mcjhp                                     Started container acmesolver
27m         Normal    Killing                                   pod/cm-acme-http-solver-mcjhp                                     Stopping container acmesolver
27m         Normal    Sync                                      ingress/cm-acme-http-solver-pspp4                                 Scheduled for sync
27m         Normal    Sync                                      ingress/cm-acme-http-solver-pspp4                                 Scheduled for sync
60m         Normal    Pulled                                    pod/postgres-db-2-0                                               Container image "postgres:11.15" already present on machine
60m         Normal    Created                                   pod/postgres-db-2-0                                               Created container postgres-db-2
60m         Normal    Started                                   pod/postgres-db-2-0                                               Started container postgres-db-2
31m         Warning   BackOff                                   pod/postgres-db-2-0                                               Back-off restarting failed container
30m         Normal    Scheduled                                 pod/postgres-db-2-0                                               Successfully assigned heyliot-postgres/postgres-db-2-0 to aks-sysnodepool-27761590-vmss000008
30m         Normal    Pulled                                    pod/postgres-db-2-0                                               Container image "postgres:11.15" already present on machine
30m         Normal    Created                                   pod/postgres-db-2-0                                               Created container setup-replica-data-directory
30m         Normal    Started                                   pod/postgres-db-2-0                                               Started container setup-replica-data-directory
29m         Normal    Pulled                                    pod/postgres-db-2-0                                               Container image "postgres:11.15" already present on machine
29m         Normal    Created                                   pod/postgres-db-2-0                                               Created container postgres-db-2
29m         Normal    Started                                   pod/postgres-db-2-0                                               Started container postgres-db-2
33s         Warning   BackOff                                   pod/postgres-db-2-0                                               Back-off restarting failed container
30m         Normal    SuccessfulCreate                          statefulset/postgres-db-2                                         create Pod postgres-db-2-0 in StatefulSet postgres-db-2 successful
30m         Normal    BlockingOperationTimedOut                 kubegres/postgres-db                                              Blocking-Operation timed-out. 'OperationId': Replica DB count spec enforcement, 'StepId': Replica DB is deploying
30m         Warning   ReplicaStatefulSetDeploymentTimedOutErr   kubegres/postgres-db                                              Last deployment attempt of a Replica DB StatefulSet has timed-out after 300 seconds. The new Replica DB is still NOT ready. It must be fixed manually. Until the ReplicaDB is ready, most of the features of Kubegres are disabled for safety reason.  'Replica DB StatefulSet to fix': postgres-db-2 - Replica DB StatefulSet deployment timed-out
14m         Normal    Type                                      service/postgres                                                  NodePort -> LoadBalancer
14m         Normal    EnsuringLoadBalancer                      service/postgres                                                  Ensuring load balancer
14m         Normal    EnsuredLoadBalancer                       service/postgres                                                  Ensured load balancer
bastoune commented 2 years ago

The reason was the postgres version used (11.15). I changed to postgres 14 and it works like a charm now.