reactive-tech / kubegres

Kubegres is a Kubernetes operator allowing to deploy one or many clusters of PostgreSql instances and manage databases replication, failover and backup.
https://www.kubegres.io
Apache License 2.0
1.32k stars 74 forks source link

The update of an existing Kubegres resource fails if the field 'resources' contains a value with a decimal point #54

Closed samstride closed 2 years ago

samstride commented 3 years ago

Thank you for maintaining this repo.

I am looking for steps/recommendations for upgrading between minor versions and major versions.

I am guessing that upgrading between minor versions is as simple as changing the container image, i.e. postgres:13.2 -> postgres:13.4.

Now that the official image for Postgres 14 is available, are there any steps that need to be followed to go from postgres:13.2 -> postgres:14.0 ?

Cheers.

alex-arica commented 3 years ago

Thank you for your message. There is a feature #7 to automatize the process to upgrade a Postgres major version. It should be available by the end of the year 2021.

You can do it manually as follows:

1) Pause the Kubegres controller by running:

kubectl scale --replicas=0 deployment.apps/kubegres-controller-manager -n kubegres-system

2) Connect to each Pod with kubectl exec -it <podName> bash and run:

pg_upgrade

3) Once all Pods are upgraded, and are in a running state (make sure to check each Pod logs), you can resume the Kubegres controller by running:

kubectl scale --replicas=1 deployment.apps/kubegres-controller-manager -n kubegres-system

Please let me know if the above works for you in your dev environment.

samstride commented 3 years ago

Ok, I have noted these steps and will keep an eye out for availability of the automation.

Once again, thank you for maintaining this repo.

samstride commented 3 years ago

@alex-arica , sorry for re-opening this issue but wanted to clarify something for upgrading between minor versions.

The steps provided above is that for both major and minor version upgrades?

I tried to upgrade from 14.0 -> 14.1.

These are the steps I followed:

kubectl apply -f https://raw.githubusercontent.com/reactive-tech/kubegres/v1.13/kubegres.yaml
# my-postgres.yaml

apiVersion: kubegres.reactive-tech.io/v1
kind: Kubegres
metadata:
  name: mypostgres
  namespace: default

spec:

   replicas: 3
   image: postgres:14.1

   database:
      size: 200Mi

   env:
      - name: POSTGRES_PASSWORD
        valueFrom:
           secretKeyRef:
              name: mypostgres-secret
              key: superUserPassword

      - name: POSTGRES_REPLICATION_PASSWORD
        valueFrom:
           secretKeyRef:
              name: mypostgres-secret
              key: replicationUserPassword
kubectl apply -f my-postgres.yaml

Only 1 of the replicas got upgraded.

Did I miss something or are the steps the same as upgrade for major version?

alex-arica commented 3 years ago

The upgrade of minor version should work. Postgres allows to upgrade between minor versions.

In this use case, what may have happened is the 1st Pod which was upgraded had an issue and Kubegres did not continue the upgrade. For safety reason, Kubegres upgrades a replica first, if it does not work, it will stop the upgrade and log that the failing pod should be investigated manually.

Do you you have the logs of the Kubegres controller?

To read the logs, you can follow those steps: kubectl get all -n kubegres-system kubectl logs pod/kubegres-controller-manager-[to replace] -c manager -n kubegres-system -f

samstride commented 3 years ago
kubectl get all -n kubegres-system     

NAME                                               READY   STATUS    RESTARTS   AGE
pod/kubegres-controller-manager-6887874b9d-f7c4m   2/2     Running   0          41h

NAME                                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/kubegres-controller-manager-metrics-service   ClusterIP   10.43.210.165   <none>        8443/TCP   34d

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kubegres-controller-manager   1/1     1            1           34d

NAME                                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/kubegres-controller-manager-6887874b9d   1         1         1       41h
replicaset.apps/kubegres-controller-manager-75b6765589   0         0         0       34d

Logs are too big when I run kubectl logs pod/kubegres-controller-manager-6887874b9d-f7c4m -c manager -n kubegres-system -f. Pasting only the error that I see a lot in the logs.

ERROR   controllers.Kubegres    Last Spec enforcement attempt has timed-out for a StatefulSet. You must apply different spec changes to your Kubegres resource since the previous spec changes did not work. Until you apply it, most of the features of Kubegres are disabled for safety reason.     {"StatefulSet's name": "postgres-3", "One or many of the following specs failed: ": "Resources: &ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Requests:ResourceList{cpu: {{5 -1} {<nil>}  DecimalSI},memory: {{524288000 0} {<nil>} 500Mi BinarySI},},}", "error": "Spec enforcement timed-out"}

I made sure there is enough CPU and memory.

I also reduced CPU to 200m and re-applied just to see what would happen:

DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Kubegres","namespace":"postgres","name":"postgres","uid":"7af18b13-af77-4271-81e6-1e2b6c29dd9c","apiVersion":"kubegres.reactive-tech.io/v1","resourceVersion":"32256132"}, "reason": "StatefulSetOperation", "message": "The Spec is NOT up-to-date for a StatefulSet. 'StatefulSet name': postgres-3, 'SpecName': Resources, 'Expected': &ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Requests:ResourceList{cpu: {{2 -1} {<nil>}  DecimalSI},memory: {{524288000 0} {<nil>} 500Mi BinarySI},},}, 'Current': &ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Requests:ResourceList{cpu: {{200 -3} {<nil>} 200m DecimalSI},memory: {{524288000 0} {<nil>} 500Mi BinarySI},},}"}

DEBUG   controller-runtime.manager.events       Warning {"object": {"kind":"Kubegres","namespace":"postgres","name":"postgres","uid":"7af18b13-af77-4271-81e6-1e2b6c29dd9c","apiVersion":"kubegres.reactive-tech.io/v1","resourceVersion":"32256132"}, "reason": "StatefulSetSpecEnforcementTimedOutErr", "message": "Last Spec enforcement attempt has timed-out for a StatefulSet. You must apply different spec changes to your Kubegres resource since the previous spec changes did not work. Until you apply it, most of the features of Kubegres are disabled for safety reason.  'StatefulSet's name': postgres-3, 'One or many of the following specs failed: ': Resources: &ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Requests:ResourceList{cpu: {{2 -1} {<nil>}  DecimalSI},memory: {{524288000 0} {<nil>} 500Mi BinarySI},},} - Spec enforcement timed-out"}

DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Kubegres","namespace":"postgres","name":"postgres","uid":"7af18b13-af77-4271-81e6-1e2b6c29dd9c","apiVersion":"kubegres.reactive-tech.io/v1","resourceVersion":"32256132"}, "reason": "StatefulSetOperation", "message": "The Spec is NOT up-to-date for a StatefulSet. 'StatefulSet name': postgres-3, 'SpecName': Resources, 'Expected': &ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Requests:ResourceList{cpu: {{2 -1} {<nil>}  DecimalSI},memory: {{524288000 0} {<nil>} 500Mi BinarySI},},}, 'Current': &ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Requests:ResourceList{cpu: {{200 -3} {<nil>} 200m DecimalSI},memory: {{524288000 0} {<nil>} 500Mi BinarySI},},}"}

Please let me know if you need any other info.

Thanks for helping out.

alex-arica commented 3 years ago

Thank you for those details.

Looking to the logs it seems like the issue is not because Postgres image was upgraded but because of the contents of the field "resources" in the YAML of "kind: Kubegres".

Could you please share the contents of the YAML containing the configuration of the Postgres cluster?

Could you please share the logs of the Pod "postgres-3" ?

samstride commented 3 years ago

I used the same resource values below when I first set it up.

apiVersion: kubegres.reactive-tech.io/v1
kind: Kubegres
metadata:
  name: postgres
  namespace: postgres

spec:

  replicas: 3
  image: postgres:14.1

  database:
    size: 20Gi
    storageClassName: postgres-nfs

  resources:
    limits:
      memory: "1Gi"
      cpu: "1"
    requests:
      memory: "500Mi"
      cpu: "0.5"

  failover:
    isDisabled: false

  backup:
    schedule: "45 */1 * * *"
    pvcName: postgres-backup
    volumeMount: /var/lib/backup

  env:
    - name: POSTGRES_PASSWORD
      valueFrom:
          secretKeyRef:
            name: postgres
            key: superUserPassword

    - name: POSTGRES_REPLICATION_PASSWORD
      valueFrom:
          secretKeyRef:
            name: postgres
            key: replicationUserPassword

Logs from postgres-3

kubectl logs -f postgres-3-0 -n postgres

PostgreSQL Database directory appears to contain a database; Skipping initialization

2021-11-16 20:57:11.240 GMT [1] LOG:  starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2021-11-16 20:57:11.240 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-11-16 20:57:11.240 GMT [1] LOG:  listening on IPv6 address "::", port 5432
2021-11-16 20:57:11.242 GMT [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-11-16 20:57:11.267 GMT [25] LOG:  database system was shut down in recovery at 2021-11-16 20:57:06 GMT
2021-11-16 20:57:11.269 GMT [25] LOG:  entering standby mode
2021-11-16 20:57:11.280 GMT [25] LOG:  redo starts at 1/137D3828
2021-11-16 20:57:11.280 GMT [25] LOG:  consistent recovery state reached at 1/137D3910
2021-11-16 20:57:11.280 GMT [25] LOG:  invalid record length at 1/137D3910: wanted 24, got 0
2021-11-16 20:57:11.281 GMT [1] LOG:  database system is ready to accept read-only connections
2021-11-16 20:57:11.296 GMT [29] LOG:  started streaming WAL from primary at 1/13000000 on timeline 1
alex-arica commented 3 years ago

Thank you for those details.

From those log details, I am not able to find the root cause of the issue that you are experiencing.

I will try reproducing this issue on my local environment by reusing the YAML that you shared. I will let you know once I have more info.

The steps that I will follow will be:

Please let me know if the steps above are the ones that you followed before experiencing this issue.

samstride commented 3 years ago

@alex-arica ,

I Installed operator Kubegres 1.12 and Postgres 14.

I upgraded operator to 1.13 and changed Postgres to 14.1.

After deploying 14.1 since I ran into the issue, I modified CPU values from 0.5 to 0.2 and back to 0.5.

alex-arica commented 3 years ago

I released a cluster of 3 Postgres pods with the following YAML :

apiVersion: kubegres.reactive-tech.io/v1
kind: Kubegres
metadata:
  name: mypostgres
  namespace: default
spec:

  replicas: 3
  image: postgres:14
  #port: 5432

  database:
    size: 200Mi

  env:
    - name: POSTGRES_PASSWORD
      valueFrom:
        secretKeyRef:
          name: mypostgres-secret
          key: superUserPassword

    - name: POSTGRES_REPLICATION_PASSWORD
      valueFrom:
        secretKeyRef:
          name: mypostgres-secret
          key: replicationUserPassword

Once the 3 pods were running as expected, I updated the YAML above by setting: image: postgres:14.1. Kubegres has upgraded all pods from version 14 to 14.1. Looking to the logs all pods are running fine.

I could not reproduce the issue that you reported about version upgrade.

I suggest that you try using the minimum configuration, as the YAML above. Then you can add more options in the YAML by steps, until it fails. That way you can identify the specific configuration which fails.

Please let me know if you need any help.

samstride commented 3 years ago

@alex-arica , ok, so the upgrade from 14.0 to 14.1 worked as soon as I got rid of the resources section in the yaml.

Hmmm, looks like a bug which causes the upgrade to fail if the resources section is present?

alex-arica commented 2 years ago

Thank you for reporting this. I've spent few hours to understand why when a field ''resources'' contains a decimal point, it would be an issue when updating it.

For example, when creating a resource of kind: Kubegres, if the following value for cpu contains a decimal point:

...
requests:
      cpu: "0.5"
      memory: "500Mi"

Then when creating the resource of kind: Kubegres, Kubernetes would reformat that decimal point value to:

...
requests:
      cpu: "500m"
      memory: "500Mi"

Which is fine and works ok. However, after the creation of the resource, when we edit that cpu value to another decimal value, such as:

requests:
      cpu: "0.4"
      memory: "500Mi"

Kubernetes would keep the cpu value as above with 0.4 rather than 400m for the resource of kind: Kubegres. However, when Kubegres operator sets that new cpu value to the StatefulSet they get formatted as follows:

requests:
      cpu: "400m"
      memory: "500Mi"

So the equality comparaison fails.

I made a change with the equality check and the changes are available with Kubegres version 1.14.

alex-arica commented 2 years ago

Kubegres version 1.14 is available with the changes that we discussed about in this issue.

Please see the release page: https://github.com/reactive-tech/kubegres/releases/tag/v1.14

Thank you @samstride for reporting this issue.

To install Kubegres 1.14, please run:

kubectl apply -f https://raw.githubusercontent.com/reactive-tech/kubegres/v1.14/kubegres.yaml

I am closing this issue.