Database rolled back - Githubissues

edwardzjl commented 3 years ago

We experienced a rollback yesterday and some data was lost.

I found that some pods were re-arranged.

We used to have postgres-1-0, postgres-2-0 and postgres-3-0, and now we have postgres-2-0 (restarted), postgres-3-0 and postgres-4-0 (new):

Also, the volumes were updated. We tried to expand volumes from 200m to 10G, but only volume 3 got expanded (like the demo in this issue).

Now we have 4 volumes all having 10G capacity (I assume volume postgres-db-postgres-1-0 have the data we lost, as we don't have important data now, is it ok I just delete the pv and pvc?)

We did not update kubegres or kubernetes, nor did we restarted the kubegres deployment, it happens automatically.

Here is the controller log:

alex-arica commented 3 years ago

Thank you for your message. Which version of Kubegres do you use?

From version 1.11 we disabled the feature allowing to expand the volume because Kubernetes does not support the automatic update of StatefulSet when PVC sizes are updated.

Please see this release for more details: https://github.com/reactive-tech/kubegres/releases/tag/v1.11

I recommand upgrading to the latest version of Kubegres: https://github.com/reactive-tech/kubegres/releases

You can expand the volume manually following the following steps: 1) Pause the Kubegres controller by running:

kubectl scale --replicas=0 deployment.apps/kubegres-controller-manager -n kubegres-system

2) Manually update the size of all your PVC. I assume that the storage class provisioner that you are using supports volume expansion.

3) Once the PVC sizes are updated, make a copy of your StatefulSets and delete them. And recreate them from the copy by making sure the size of the volumeClaimTemplates is set to the new size:

volumeClaimTemplates:
    - metadata:
        name: postgres-db
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: standard
        resources:
          requests:
            storage: [new size]

4) The restart of the PVC will re-create the pods. Once the pods are in a running state, you can resume the Kubegres controller by running:

kubectl scale --replicas=1 deployment.apps/kubegres-controller-manager -n kubegres-system

Once Kubernetes support the resizing of volume from StatefulSet, we will re-enable this feature in Kubegres. The relate issue is #49

I am going to add a documentation to explain how to manually expand storage in Kubegres website.

edwardzjl commented 3 years ago

Apologize for my poor english, it's not the volume problem.

We did try to expand the volume, about 20 days ago (right after we did the deployment with default volume of 200m). It did not success (with only 1 volume expanded, volume of postgres-3-0), and we leaved the state there, as the cluster works fine.

We used the half-expanded cluster for 19 days with no problem. But yesterday we experienced a data loss, and we found that the pods are re-arranged, postgres-1-0 was lost so does its data.

We are using kubegres 1.10 by the way.

alex-arica commented 3 years ago

Thank you for your message.

After you expanded the volume 20 days ago, have you checked each Pod logs to see whether the replication was still working?

I suppose you expanded the volume from the YAML of "kind: Kubegres"?

edwardzjl commented 3 years ago

Sorry, I didn't check the pod logs after the expansion. And yes, I changed the spec.database.size of kind: Kubegres

alex-arica commented 3 years ago

Unfortunately without the logs it's very difficult to know what happened.

My guess is after updating the size of the database in YAML, Kubegres prior to version 1.11 tried to update the size but Statefulsets were not getting refreshed due to Kubernetes lack of features on that side.

Consequently, after sometime Kubegres timed out the resize operation. Usually the message displayed when this happens is to ask the admin using Kubegres to fix the issue manually.

What is not clear is why replication would stop. I don't think that's related to database's resize operation. Most likely Postgres logs would have provided some clues.

I am closing this issue because of missing logs.

reactive-tech / kubegres

Database rolled back #55