[Bug] Update probes with galera is not working

mperochon commented 1 month ago

Documentation

[ x] I acknowledge that I have read the relevant documentation.

Hello,

I am trying to synchronize my nodes in recovery mode, but the data to replicate exceeds 120GB.

The replication fails because the probes fail (error code 500) and the pods keep restarting infinitely.

I tried updating the probes by modifying the values of readinessProbe and livenessProbe, but they are not applied to my pods. I still have the default values.

Expected behaviour Replication of all data on all nodes.

Steps to reproduce the bug

Restore a large backup on the node 0 pvc storage
Deploy Galeradb with 3 nodes
take on look on the node-1 logs.

Debug information

Related object events:

Name:            k8-dev-mariadb-test-1
Namespace:        galera-db
Priority:         0
Service Account:  k8-dev-mariadb-test
Node:             k8-dev-kubernetes-nodepo-node-283e6a/10.100.1.51
Start Time:       Tue, 23 Jul 2024 20:19:11 +0200
Status:           Running
IP:               10.2.4.223
IPs:
IP:           10.2.4.223
Controlled By:  StatefulSet/k8-dev-mariadb-test
Init Containers:
init:
Container ID:  containerd://1f680019953f543aa65b57893e8a0304f5519cefa4b83f13aa5fecf1fe7fa86e
Image:         docker-registry3.mariadb.com/mariadb-operator/mariadb-operator:v0.0.29
Image ID:      docker-registry3.mariadb.com/mariadb-operator/mariadb-operator@sha256:dcdada67dd9d85fec3670bf9af5a98cd28bd3b46998f9bc83ba247f807bc1370
Port:          <none>
Host Port:     <none>
Args:
  init
  --config-dir=/etc/mysql/mariadb.conf.d
  --state-dir=/var/lib/mysql
State:          Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Tue, 23 Jul 2024 20:19:43 +0200
  Finished:     Tue, 23 Jul 2024 20:19:52 +0200
Ready:          True
Restart Count:  0
Environment:
  MYSQL_TCP_PORT:            3306
  MARIADB_ROOT_HOST:         %
  MYSQL_INITDB_SKIP_TZINFO:  1
  CLUSTER_NAME:              cluster.local
  POD_NAME:                 k8-dev-mariadb-test-1 (v1:metadata.name)
  POD_NAMESPACE:             galera-db (v1:metadata.namespace)
  POD_IP:                     (v1:status.podIP)
  MARIADB_NAME:              k8-dev-mariadb-test
  MARIADB_ROOT_PASSWORD:     <set to the key 'GALERA_DB_ROOT_PASSWORD' in secret 'k8-dev-galera-db-credentials'>  Optional: false
Mounts:
  /etc/mysql/conf.d from config (rw)
  /etc/mysql/mariadb.conf.d from galera (rw)
  /var/lib/mysql from storage (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from serviceaccount (rw)
Containers:
mariadb:
Container ID:   containerd://e48e725ea153fd54727bb1999b9c6d163a94ca39ed2527966f6285b793272cab
Image:          docker-registry1.mariadb.com/library/mariadb:10.6.18
Image ID:       docker-registry1.mariadb.com/library/mariadb@sha256:f7cc395e35257dfb332ad73a80c36e74a6990c209519760c7b9bba6b2ab47a86
Ports:          3306/TCP, 4567/TCP, 4568/TCP, 4444/TCP
Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
State:          Running
  Started:      Tue, 23 Jul 2024 20:52:28 +0200
Last State:     Terminated
  Reason:       Error
  Exit Code:    137
  Started:      Tue, 23 Jul 2024 20:51:27 +0200
  Finished:     Tue, 23 Jul 2024 20:52:28 +0200
Ready:          False
Restart Count:  13
Liveness:       http-get http://:5555/liveness delay=20s timeout=5s period=5s #success=1 #failure=3
Readiness:      http-get http://:5555/readiness delay=20s timeout=5s period=5s #success=1 #failure=3
Environment:
  MYSQL_TCP_PORT:            3306
  MARIADB_ROOT_HOST:         %
  MYSQL_INITDB_SKIP_TZINFO:  1
  CLUSTER_NAME:              cluster.local
  POD_NAME:                 k8-dev-mariadb-test-1 (v1:metadata.name)
  POD_NAMESPACE:             galera-db (v1:metadata.namespace)
  POD_IP:                     (v1:status.podIP)
  MARIADB_NAME:             k8-dev-mariadb-test
  MARIADB_ROOT_PASSWORD:     <set to the key 'GALERA_DB_ROOT_PASSWORD' in secret 'k8-dev-galera-db-credentials'>  Optional: false
Mounts:
  /etc/mysql/conf.d from config (rw)
  /etc/mysql/mariadb.conf.d from galera (rw)
  /var/lib/mysql from storage (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from serviceaccount (rw)
agent:
Container ID:  containerd://9e9dd73f63a4aa1ad273ec16816b004d72207628ffde8c1b251bf007330eb559
Image:         docker-registry3.mariadb.com/mariadb-operator/mariadb-operator:v0.0.29
Image ID:      docker-registry3.mariadb.com/mariadb-operator/mariadb-operator@sha256:dcdada67dd9d85fec3670bf9af5a98cd28bd3b46998f9bc83ba247f807bc1370
Port:          5555/TCP
Host Port:     0/TCP
Args:
  agent
  --addr=:5555
  --config-dir=/etc/mysql/mariadb.conf.d
  --state-dir=/var/lib/mysql
  --graceful-shutdown-timeout=1s
  --recovery-timeout=1h30m0s
  --kubernetes-auth
  --kubernetes-trusted-name=mariadb-operator
  --kubernetes-trusted-namespace=galera-db
State:          Running
  Started:      Tue, 23 Jul 2024 20:19:53 +0200
Ready:          True
Restart Count:  0
Liveness:       http-get http://:5555/health delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness:      http-get http://:5555/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
  MYSQL_TCP_PORT:            3306
  MARIADB_ROOT_HOST:         %
  MYSQL_INITDB_SKIP_TZINFO:  1
  CLUSTER_NAME:              cluster.local
  POD_NAME:                 k8-dev-mariadb-test-1 (v1:metadata.name)
  POD_NAMESPACE:             galera-db (v1:metadata.namespace)
  POD_IP:                     (v1:status.podIP)
  MARIADB_NAME:             k8-dev-mariadb-test
  MARIADB_ROOT_PASSWORD:     <set to the key 'GALERA_DB_ROOT_PASSWORD' in secret 'k8-dev-galera-db-credentials'>  Optional: false
Mounts:
  /etc/mysql/conf.d from config (rw)
  /etc/mysql/mariadb.conf.d from galera (rw)
  /var/lib/mysql from storage (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from serviceaccount (rw)
Conditions:
Type                        Status
PodReadyToStartContainers   True 
Initialized                 True 
Ready                       False 
ContainersReady             False 
PodScheduled                True 
Volumes:
galera:
Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName:  galera-k8-dev-mariadb-test-1
ReadOnly:   false
storage:
Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName:  storage-k8-dev-mariadb-test-1
ReadOnly:   false
config:
Type:               Projected (a volume that contains injected data from multiple sources)
ConfigMapName:      k8-dev-mariadb-test-config-default
ConfigMapOptional:  <nil>
serviceaccount:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3600
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                         node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type     Reason                  Age                   From                     Message
----     ------                  ----                  ----                     -------
Warning  FailedScheduling        33m                   default-scheduler        0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
Normal   Scheduled               33m                   default-scheduler        Successfully assigned galera-db/k8-dev-mariadb-test-1 tok8-dev-kubernetes-nodepo-node-283e6a
Normal   SuccessfulAttachVolume  33m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "ovh-managed-kubernetes-7fsy1c-pvc-8741bc63-4f9b-4d27-82bf-f885925fcc4f"
Normal   SuccessfulAttachVolume  33m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "ovh-managed-kubernetes-7fsy1c-pvc-f13ac30a-a50f-48be-a244-b5fdb8507660"
Normal   Pulled                  33m                   kubelet                  Container image "docker-registry3.mariadb.com/mariadb-operator/mariadb-operator:v0.0.29" already present on machine
Normal   Created                 33m                   kubelet                  Created container init
Normal   Started                 33m                   kubelet                  Started container init
Normal   Started                 33m                   kubelet                  Started container agent
Normal   Started                 33m                   kubelet                  Started container mariadb
Normal   Pulled                  33m                   kubelet                  Container image "docker-registry3.mariadb.com/mariadb-operator/mariadb-operator:v0.0.29" already present on machine
Normal   Created                 33m                   kubelet                  Created container agent
Normal   Killing                 32m                   kubelet                  Container mariadb failed liveness probe, will be restarted
Normal   Created                 32m (x2 over 33m)     kubelet                  Created container mariadb
Normal   Pulled                  32m (x2 over 33m)     kubelet                  Container image "docker-registry1.mariadb.com/library/mariadb:10.6.18" already present on machine
Warning  Unhealthy               28m (x51 over 32m)    kubelet                  Readiness probe failed: HTTP probe failed with statuscode: 500
Warning  Unhealthy               8m12s (x32 over 32m)  kubelet                  Liveness probe failed: HTTP probe failed with statuscode: 500
Warning  BackOff                 3m17s (x87 over 27m)  kubelet                  Back-off restarting failed container mariadb in pod k8-dev-mariadb-test-1_galera-db(40e25506-23af-43ac-881f-b28ff377a0e4)

Environment details:

Kubernetes version: 1.30
Kubernetes distribution: OVH Kubernetes
mariadb-operator version: 0.0.29
Install method: helm
Install flavor: recommended

Additional context

This my configuration

{
                "apiVersion": "k8s.mariadb.com/v1alpha1",
                "kind": "MariaDB",
                "metadata": {
                    "name": `${CDK_PREFIX_STACK}-${CDK_ENVIRONMENT}-mariadb-test`,
                    "namespace": this.config.galeraDB.namespace,
                },
                "spec": {
                    "image": "docker-registry1.mariadb.com/library/mariadb:10.6.18",
                    "username": GALERA_DB_USER,
                    "passwordSecretKeyRef": {
                        "name": `${CDK_PREFIX_STACK}-${CDK_ENVIRONMENT}-galera-db-credentials`,
                        "key": "GALERA_DB_PASSWORD",
                    },
                    "database": this.config.galeraDB.defaultDBName,
                    "rootPasswordSecretKeyRef": {
                        "name": `${CDK_PREFIX_STACK}-${CDK_ENVIRONMENT}-galera-db-credentials`,
                        "key": "GALERA_DB_ROOT_PASSWORD",
                    },
                    "podSecurityContext": {
                        "runAsUser": 999,
                        "runAsGroup": 999,
                        "fsGroup": 999
                    },
                    "metrics": {
                        "enabled": true,
                    },
                    "storage": {
                        "volumeClaimTemplate": {
                            "storageClassName": this.config.galeraDB.storageClassName,
                            "accessModes": ["ReadWriteOnce"],
                            "resources": {
                                "requests": {
                                    "storage": this.config.galeraDB.storageSize
                                }
                            }
                        }
                    },
                    "replicas": 3,
                    "livenessProbe": {
                        "initialDelaySeconds": 3600,
                        "periodSeconds": 3600,
                        "successThreshold": 1,
                        "timeoutSeconds": 3600,
                    },
                    "readinessProbe": {
                        "initialDelaySeconds": 3600,
                        "periodSeconds": 3600,
                        "successThreshold": 1,
                        "timeoutSeconds": 3600,
                    },
                    "galera": {
                        "enabled": "true",
                        "recovery": {
                            "enabled": true,
                            "minClusterSize": 3,
                            "clusterMonitorInterval": "2h0m0s",
                            "clusterHealthyTimeout": "2h0m0s",
                            "clusterBootstrapTimeout": "2h0m0s",
                            "podRecoveryTimeout": "1h30m0s",
                            "podSyncTimeout": "2h0m0s"
                        },
                        "livenessProbe": {
                            "initialDelaySeconds": 3600,
                            "periodSeconds": 3600,
                            "successThreshold": 1,
                            "timeoutSeconds": 3600,
                        },
                        "readinessProbe": {
                            "initialDelaySeconds": 3600,
                            "periodSeconds": 3600,
                            "successThreshold": 1,
                            "timeoutSeconds": 3600,
                        },
                        "replicaThreads": 8,
                        "agent": {
                            "livenessProbe": {
                                "initialDelaySeconds": 3600,
                                "periodSeconds": 3600,
                                "successThreshold": 1,
                                "timeoutSeconds": 3600,
                            },
                            "readinessProbe": {
                                "initialDelaySeconds": 3600,
                                "periodSeconds": 3600,
                                "successThreshold": 1,
                                "timeoutSeconds": 3600,
                            },
                        }
                    },  
                    "affinity": {
                        "podAntiAffinity": {
                            "requiredDuringSchedulingIgnoredDuringExecution": [
                                {
                                    "topologyKey": "kubernetes.io/hostname"
                                }
                            ],
                        },
                    },
                    "podDisruptionBudget": {
                        "maxUnavailable": "66%"
                    },
                    "service": {
                        "type": "LoadBalancer",
                        "metadata": {
                            "annotations": {
                                "service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources": `${this.config.privateNetwork.primarySubnet.cidr},${this.config.privateNetwork.secondarySubnet.cidr}`
                            }
                        },
                    },
                    "primaryService": {
                        "type": "LoadBalancer",
                        "metadata": {
                            "annotations": {
                                "service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources": `${this.config.privateNetwork.primarySubnet.cidr},${this.config.privateNetwork.secondarySubnet.cidr}`
                            }
                        },
                    },
                    "secondaryService": {
                        "type": "LoadBalancer",
                        "metadata": {
                            "annotations": {
                                "service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources": `${this.config.privateNetwork.primarySubnet.cidr},${this.config.privateNetwork.secondarySubnet.cidr}`
                            }
                        },
                    }
                }
            }

mmontes11 commented 1 month ago

Hey there @mperochon !

I have updated the v0.0.30 (yet to be released in the next few weeks) with some important considerations regarding restoring backups:

https://github.com/mariadb-operator/mariadb-operator/blob/release-v0.0.30/docs/BACKUP.md#important-considerations

probes fail (error code 500)

Probes are failing most likely because the credentials provided via spec.rootPasswordSecretKeyRef don't match the database internal state after the backup is restored. In other words, spec.rootPasswordSecretKeyRef should match the root password credentials provided in the backup.

the data to replicate exceeds 120GB.

Be sure to provide enough compute resources in the restore job to make sure the restoration process doesn't get stucked:

https://github.com/mariadb-operator/mariadb-operator/blob/release-v0.0.30/docs/BACKUP.md#restore-job

mmontes11 commented 1 month ago

Just another note regarding the probes, you are able to tweak the probe thresholds but not the probe command. See:

https://github.com/mariadb-operator/mariadb-operator/blob/release-v0.0.30/docs/CONFIGURATION.md#probes

mmontes11 commented 1 month ago

Another question @mperochon:

Has the backup you are trying to restore been taken on an external database? I'm specially curious about: Does it have a DROP TABLE mysql.global_priv; statement?

If so, please take a look at this, it describes exactly this case:

https://github.com/mariadb-operator/mariadb-operator/blob/release-v0.0.30/docs/BACKUP.md#mysqlglobal_priv

TL;DR;

Remove all the statements related to mysql.global_priv in your backup:

grep -v "mysql\.global_priv" backup.2024-07-17T03:00:11Z.sql.sql > backup.2024-07-17T03:00:11Z.sql

Use the rootPasswordSecretKeyRef, username and passwordSecretKeyRef fields of the MariaDB CR to create the root and initial user respectively. This fields will be translated into DDLs by the image entrypoint.
Rely on the User and Grant CRs to create additional users and grants. They will be translated into DDL statements (CREATE USER, GRANT) by the operator.

mperochon commented 1 month ago

Hi @mmontes11,

We are copying all physical files to PVC storage of node 1.

To generate the backup, we are using this command:

mariabackup --copy-back --target-dir=/mnt/backupdata/latest/

Once the process is terminated, we are connecting the PVC with the backup on the first Galera node (node-0) (/var/lib/mysql).

After that, when Galera starts, the replication also starts but fails after 3 minutes due to the probes failure.

Let me know if you need any more test from me

mmontes11 commented 1 month ago

What I've mentioned here applies for logical backups taken with mariadb-backup, which is not your case. The Galera backup limitations still apply though.

It will be very useful to get the logs from your agent container to understand why the probes are failing:

 kubectl logs mariadb-galera-0 -c agent

I can't really advice on your procedure, but here it is another way you can restore physical backups via initContainers. This approach will restore the physical backup on each node before it starts, so all the nodes will start with the same data:

https://github.com/mariadb-operator/mariadb-operator/blob/main/examples/manifests/mariadb_init_mariabackup.yaml

As you can see, you will have to place physical backup in a PVC named mariabackup beforehand.

We don't currently support physical backups natively, but we have plans for them in our roadmap. We plan to implement PITR based on physical backups and binary logs.

mperochon commented 1 month ago

Hi @mmontes11,

Thanks for your help and I made the test with the initContainers option and I get this error when I deploy the mardiadbs object :

Error reconciling Init: Job.batch "k8-dev-mariadb-test-init" is invalid: spec.template.spec.initContainers[0].volumeMounts[3].name: Not found: "galera"

I just added theses tree parameters : initContainers, volumes and volumesMount.

This my deployment file :

{
                "apiVersion": "k8s.mariadb.com/v1alpha1",
                "kind": "MariaDB",
                "metadata": {
                    "name": `${CDK_PREFIX_STACK}-${CDK_ENVIRONMENT}-mariadb-test`,
                    "namespace": this.config.galeraDB.namespace,
                },
                "spec": {
                    "image": "docker-registry1.mariadb.com/library/mariadb:10.6.18",
                    "initContainers": [
                        {
                            "image": "docker-registry1.mariadb.com/library/mariadb:10.6.18",
                            "args": [
                                "mariadb-backup",
                                "--copy-back",
                                "--target-dir=/mnt/backup/latest/"
                            ]
                        }
                    ],
                    "volumeMounts": [
                        {
                            "name": "mariabackup",
                            "mountPath": "/mnt/backup/"
                        }
                    ],
                    "username": GALERA_DB_USER,
                    "passwordSecretKeyRef": {
                        "name": `${CDK_PREFIX_STACK}-${CDK_ENVIRONMENT}-galera-db-credentials`,
                        "key": "GALERA_DB_PASSWORD",
                    },
                    "database": this.config.galeraDB.defaultDBName,
                    "rootPasswordSecretKeyRef": {
                        "name": `${CDK_PREFIX_STACK}-${CDK_ENVIRONMENT}-galera-db-credentials`,
                        "key": "GALERA_DB_ROOT_PASSWORD",
                    },
                    "podSecurityContext": {
                        "runAsUser": 999,
                        "runAsGroup": 999,
                        "fsGroup": 999
                    },
                    "metrics": {
                        "enabled": true,
                    },
                    "storage": {
                        "size": "250Gi"
                    },
                    "volumes": [
                        {
                            "name": "mariabackup",
                            "persistentVolumeClaim": {
                                "claimName": "pvc-backup"
                            }
                        }
                    ],
                    "replicas": 4,
                    "galera": {
                        "enabled": "true",
                        "replicaThreads": 10,
                    },  
                    "affinity": {
                        "podAntiAffinity": {
                            "requiredDuringSchedulingIgnoredDuringExecution": [
                                {
                                    "topologyKey": "kubernetes.io/hostname"
                                }
                            ],
                        },
                    },
                    "podDisruptionBudget": {
                        "maxUnavailable": "66%"
                    },
                    "service": {
                        "type": "LoadBalancer",
                        "metadata": {
                            "annotations": {
                                "service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources": `${this.config.privateNetwork.primarySubnet.cidr},${this.config.privateNetwork.secondarySubnet.cidr}`
                            }
                        },
                    },
                    "primaryService": {
                        "type": "LoadBalancer",
                        "metadata": {
                            "annotations": {
                                "service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources": `${this.config.privateNetwork.primarySubnet.cidr},${this.config.privateNetwork.secondarySubnet.cidr}`
                            }
                        },
                    },
                    "secondaryService": {
                        "type": "LoadBalancer",
                        "metadata": {
                            "annotations": {
                                "service.beta.kubernetes.io/ovh-loadbalancer-allowed-sources": `${this.config.privateNetwork.primarySubnet.cidr},${this.config.privateNetwork.secondarySubnet.cidr}`
                            }
                        },
                    }
                }
}

mariadb-operator / mariadb-operator

[Bug] Update probes with galera is not working #744