reactive-tech / kubegres

Kubegres is a Kubernetes operator allowing to deploy one or many clusters of PostgreSql instances and manage databases replication, failover and backup.
https://www.kubegres.io
Apache License 2.0
1.32k stars 74 forks source link

Bug in expanding database storage. #39

Closed tgates-nalej closed 3 years ago

tgates-nalej commented 3 years ago

I am following the getting started example, using an expandable K8s StorageClass on AWS:

 % cat ~/gp2-expandable-storage-class.yaml                                                                                                          
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp2-expandable
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
  - debug
  ± % k get sc gp2-expandable                                                                                                                          !10397
NAME                       PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
gp2-expandable (default)   kubernetes.io/aws-ebs   Delete          Immediate           true                   93m

I created the Kubegres cluster with the following:

apiVersion: kubegres.reactive-tech.io/v1
kind: Kubegres
metadata:
  name: mypostgres
  namespace: default

spec:

   replicas: 3
   image: postgres:13.2

   database:
      storageClassName: gp2-expandable
      size: 200Mi

   env:
      - name: POSTGRES_PASSWORD
        valueFrom:
           secretKeyRef:
              name: mypostgres-secret
              key: superUserPassword

      - name: POSTGRES_REPLICATION_PASSWORD
        valueFrom:
           secretKeyRef:
              name: mypostgres-secret
              key: replicationUserPassword

I then created a database, table, and inserted some rows into it.

Then I edited the Kubegres CR, changing spec.database.size from 200Mi to 10Gi.

The Kubegres operator then resized the PVC for postgres-db-mypostgres-3-0 to be 10Gi:

± % k get pvc                                                                                                                                        !10390
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     AGE
postgres-db-mypostgres-1-0   Bound    pvc-7913a8da-3407-45d8-a3a2-b0cf0759a640   1Gi        RWO            gp2-expandable   17m
postgres-db-mypostgres-2-0   Bound    pvc-7442165d-c31c-4136-a598-03da51c9ad7f   1Gi        RWO            gp2-expandable   17m
postgres-db-mypostgres-3-0   Bound    pvc-22d3c475-a6aa-4a98-9ae5-780895092af9   10Gi       RWO            gp2-expandable   16m

And then it failed with a timeout error:

reactive-tech.io/kubegres/controllers/ctx/log.(*LogWrapper).ErrorEvent
    /workspace/controllers/ctx/log/LogWrapper.go:62
reactive-tech.io/kubegres/controllers/spec/enforcer/statefulset_spec.(*AllStatefulSetsSpecEnforcer).logSpecEnforcementTimedOut
    /workspace/controllers/spec/enforcer/statefulset_spec/AllStatefulSetsSpecEnforcer.go:169
reactive-tech.io/kubegres/controllers/spec/enforcer/statefulset_spec.(*AllStatefulSetsSpecEnforcer).EnforceSpec
    /workspace/controllers/spec/enforcer/statefulset_spec/AllStatefulSetsSpecEnforcer.go:109
reactive-tech.io/kubegres/controllers.(*KubegresReconciler).enforceAllStatefulSetsSpec
    /workspace/controllers/kubegres_controller.go:146
reactive-tech.io/kubegres/controllers.(*KubegresReconciler).enforceSpec
    /workspace/controllers/kubegres_controller.go:138
reactive-tech.io/kubegres/controllers.(*KubegresReconciler).Reconcile
    /workspace/controllers/kubegres_controller.go:100
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
    /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
    /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99
2021-09-06T16:20:22.504Z    DEBUG   controller-runtime.manager.events   Warning {"object": {"kind":"Kubegres","namespace":"default","name":"mypostgres","uid":"74432c5f-c21f-4e23-967d-7f0ca4441dbc","apiVersion":"kubegres.reactive-tech.io/v1","resourceVersion":"106964060"}, "reason": "StatefulSetSpecEnforcementTimedOutErr", "message": "Last Spec enforcement attempt has timed-out for a StatefulSet. You must apply different spec changes to your Kubegres resource since the previous spec changes did not work. Until you apply it, most of the features of Kubegres are disabled for safety reason.  'StatefulSet's name': mypostgres-3, 'One or many of the following specs failed: ': StorageClassSize: 10Gi - Spec enforcement timed-out

The status of the Kubegres CR shows:

    "status": {
        "blockingOperation": {
            "operationId": "Enforcing StatefulSet's Spec",
            "statefulSetOperation": {
                "instanceIndex": 3,
                "name": "mypostgres-3"
            },
            "statefulSetSpecUpdateOperation": {
                "specDifferences": "StorageClassSize: 10Gi"
            },
            "stepId": "StatefulSet's spec is updating",
            "timeOutEpocInSeconds": 1630945221
        },
        "enforcedReplicas": 3,
        "lastCreatedInstanceIndex": 3,
        "previousBlockingOperation": {
            "operationId": "Replica DB count spec enforcement",
            "statefulSetOperation": {
                "instanceIndex": 3,
                "name": "mypostgres-3"
            },
            "statefulSetSpecUpdateOperation": {},
            "stepId": "Replica DB is deploying",
            "timeOutEpocInSeconds": 1630944608
        }
    }

How do I recover from this scenario?

edwardzjl commented 3 years ago

I got a very similiar situation, only one pvc increased to 10G and other 2 stay in 200m. By the way how did you get the error message?

alex-arica commented 3 years ago

Thank you. There is an issue #34 which was reported from a user with a similar issue.

It seems like expanding the storage does not work as expected in all use-cases.

I am going to work on this potential bug fix next.

I am closing this issue and you can follow an update by watching #34

alex-arica commented 3 years ago

I committed a potential bug fix in master.

To test it, please run the following command in your Kubernetes cluster:

kubectl apply -f https://raw.githubusercontent.com/reactive-tech/kubegres/main/kubegres.yaml

Could you please help me by testing it and let me know if you are happy with it?

Once you confirmed it is working for you, I will create a new tag 1.10

Please let me know and thanks in advance for your help!