percona / percona-server-mysql-operator

Percona Operator for MySQL
https://www.percona.com/doc/kubernetes-operator-for-mysql/ps/index.html
Apache License 2.0
117 stars 25 forks source link

Cluster can not start after reboot all pods #683

Open chernomor opened 1 week ago

chernomor commented 1 week ago

Report

I've setup cluster according https://docs.percona.com/percona-operator-for-mysql/ps/kubectl.html in single node k3s. All mysql pods working fine, when I've reboot k3s node and mysql cluster can not start.

More about the problem

I've touch file /var/lib/mysql/sleep-forever in master pod cluster1-mysql-0 and it running now, but slaves in CrashLoopBackOff:

$ kubectl -n mysql-test get pods -o wide
NAME                                             READY   STATUS             RESTARTS          AGE   IP           NODE                                  NOMINATED NODE   READINESS GATES
cluster1-haproxy-0                               2/2     Running            4 (18h ago)       22h   10.42.0.54   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-haproxy-1                               2/2     Running            4 (18h ago)       22h   10.42.0.51   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-haproxy-2                               2/2     Running            4 (18h ago)       22h   10.42.0.52   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
percona-server-mysql-operator-78ccf4bd45-67p2j   1/1     Running            2 (18h ago)       22h   10.42.0.47   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-orc-0                                   2/2     Running            4 (18h ago)       22h   10.42.0.48   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-orc-2                                   2/2     Running            4 (18h ago)       22h   10.42.0.50   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-orc-1                                   2/2     Running            4 (18h ago)       22h   10.42.0.53   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-mysql-0                                 3/3     Running            454 (41m ago)     18h   10.42.0.56   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-mysql-2                                 2/3     CrashLoopBackOff   464 (2m24s ago)   18h   10.42.0.49   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>
cluster1-mysql-1                                 1/3     CrashLoopBackOff   468 (26s ago)     18h   10.42.0.55   rt-chernomorets.sas.yp-c.yandex.net   <none>           <none>

Some logs from bootstrap on slave pod:

$ kubectl -n mysql-test exec -it cluster1-mysql-1  -- tail -f /var/lib/mysql/bootstrap.log
Defaulted container "mysql" out of: mysql, xtrabackup, pt-heartbeat, mysql-init (init)
2024/06/26 07:54:31 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 07:54:41 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 07:54:41 bootstrap finished in 0.003150 seconds
2024/06/26 07:54:41 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 07:54:51 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 07:54:51 bootstrap finished in 0.003226 seconds
2024/06/26 07:54:51 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 07:55:01 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 07:55:01 bootstrap finished in 0.003110 seconds
2024/06/26 07:55:01 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:00:31 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:00:31 bootstrap finished in 0.003058 seconds
2024/06/26 08:00:31 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:00:41 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:00:41 bootstrap finished in 0.002679 seconds
2024/06/26 08:00:41 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:00:51 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:00:51 bootstrap finished in 0.003255 seconds
2024/06/26 08:00:51 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:01:01 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:01:01 bootstrap finished in 0.003455 seconds
2024/06/26 08:01:01 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:01:11 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:01:11 bootstrap finished in 0.002918 seconds
2024/06/26 08:01:11 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
command terminated with exit code 137

Steps to reproduce

  1. setup mysql cluster on single node
  2. reboot bode
  3. mysql pods do not running

Versions

  1. Kubernetes k3s version v1.29.5+k3s1 (4e53a323) go version go1.21.9

  2. Operator 83b9f60ec88d0cd2b5b1a2c2721bd6ae18fc7dc8, v0.7.0

  3. Database mysql Ver 8.0.36-28 for Linux on x86_64 (Percona Server (GPL), Release 28, Revision 47601f19)

Anything else?

No response

chernomor commented 1 week ago

As I see, bootstrapAsyncReplication expects all cluster peers from getTopology, but getTopology can not connect to some peers as all nodes now in CrashLoopBackOff state. I think it not need to require all pods be available.

Another problem (or first?), which was suppressed with sleep-forever now: master pod can not start as it can not resolve primary name cluter1-mysql-0.cluster1-mysql.mysql-test retrived from replica status and this name is not resolved now, becouse pods has names like cluster1-mysql-unready.mysql-test while pods is in starting states (I could be wrong). I d't know how it may be fixed now.

2024/06/25 16:51:32 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/25 16:51:32 Primary: cluster1-mysql-0.cluster1-mysql.mysql-test Replicas: [cluster1-mysql-1.cluster1-mysql.mysql-test cluster1-mysql-2.cluster1-mysql.mysql-test]
2024/06/25 16:51:32 FQDN: cluster1-mysql-0.cluster1-mysql.mysql-test
2024/06/25 16:51:32 lookup cluster1-mysql-0 [10.42.0.56]
2024/06/25 16:51:32 PodIP: 10.42.0.56
2024/06/25 16:51:32 bootstrap finished in 0.021992 seconds
2024/06/25 16:51:32 bootstrap failed: get primary IP: lookup cluster1-mysql-0.cluster1-mysql.mysql-test: lookup cluster1-mysql-0.cluster1-mysql.mysql-test on 10.43.0.10:53: server misbehaving
2024/06/25 16:51:42 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/25 16:51:42 Primary: cluter1-mysql-0.cluster1-mysql.mysql-test Replicas: [cluster1-mysql-1.cluster1-mysql.mysql-test cluster1-mysql-2.cluster1-mysql.mysql-test]
2024/06/25 16:51:42 FQDN: cluster1-mysql-0.cluster1-mysql.mysql-test
2024/06/25 16:51:42 lookup cluster1-mysql-0 [10.42.0.56]
2024/06/25 16:51:42 PodIP: 10.42.0.56
2024/06/25 16:51:42 bootstrap finished in 0.021340 seconds
2024/06/25 16:51:42 bootstrap failed: get primary IP: lookup cluster1-mysql-0.cluster1-mysql.mysql-test: lookup cluster1-mysql-0.cluster1-mysql.mysql-test on 10.43.0.10:53: server misbehaving

Some changes in deploy/cr.yaml:

--- a/deploy/cr.yaml
+++ b/deploy/cr.yaml
@@ -31,7 +31,7 @@ spec:
 #      group: cert-manager.io

   mysql:
-    clusterType: group-replication
+    clusterType: async
     autoRecovery: true
     image: percona/percona-server:8.0.36-28
     imagePullPolicy: Always
@@ -58,9 +58,12 @@ spec:
 #      periodSeconds: 10
 #      failureThreshold: 3
 #      successThreshold: 1
+#
+    startupProbe:
+      failureThreshold: 5

     affinity:
-      antiAffinityTopologyKey: "kubernetes.io/hostname"
+       antiAffinityTopologyKey: "none"
 #      advanced: