percona / percona-server-mysql-operator

Percona Operator for MySQL
https://www.percona.com/doc/kubernetes-operator-for-mysql/ps/index.html
Apache License 2.0
160 stars 26 forks source link

Cannot clone slaves after restore from big backup #740

Open chernomor opened 2 months ago

chernomor commented 2 months ago

Report

I create empty async cluster with 3 nodes and restore from backup. Master is alive and running, but slave cannot make clone.

More about the problem

...
Containers:
  mysql:
    Container ID:  containerd://5304dd2d181eca0d388fb606370d35d42f639a3bf70d6ed883525bbea2e39b61
    Image:         percona/percona-server:8.0.36-28
    Image ID:      docker.io/percona/percona-server@sha256:423acd206f94b34288d10ed041c3ba42543e26e44f3706621320504a010dd41f
    Ports:         3306/TCP, 33062/TCP, 33060/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      /opt/percona/ps-entrypoint.sh
    Args:
      mysqld
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 03 Sep 2024 17:37:54 +0300
      Finished:     Tue, 03 Sep 2024 17:38:32 +0300
    Ready:          False
    Restart Count:  11
    Limits:
      memory:  24G
    Requests:
      memory:   20G
    Liveness:   exec [/opt/percona/healthcheck liveness] delay=15s timeout=10s period=10s #success=1 #failure=3
    Readiness:  exec [/opt/percona/healthcheck readiness] delay=30s timeout=3s period=5s #success=1 #failure=3
    Startup:    exec [/opt/percona/bootstrap] delay=15s timeout=43200s period=10s #success=1 #failure=1
...

Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       21m                 default-scheduler  Successfully assigned racktables-prestable/racktables-prestable-mysql-1 to iva-n01.net
  Normal   AddedInterface  21m                 multus             Add eth0 [fcff:0:675:1:c278:c2c4:8308:288/128] from k8s-pod-network
  Normal   Pulling         21m                 kubelet            Pulling image "registry/percona-server-mysql-operator:0.8.0-debug5"
  Normal   Pulled          21m                 kubelet            Successfully pulled image "registry/percona-server-mysql-operator:0.8.0-debug5" in 247ms (247ms including waiting). Image size: 101676545 bytes.
  Normal   Created         21m                 kubelet            Created container mysql-init
  Normal   Started         21m                 kubelet            Started container mysql-init
  Normal   Pulled          21m                 kubelet            Successfully pulled image "percona/percona-server:8.0.36-28" in 1.665s (1.665s including waiting). Image size: 403423459 bytes.
  Normal   Pulling         21m                 kubelet            Pulling image "percona/percona-xtrabackup:8.0.35-31"
  Normal   Started         21m                 kubelet            Started container xtrabackup
  Normal   Pulled          21m                 kubelet            Successfully pulled image "percona/percona-xtrabackup:8.0.35-31" in 1.492s (1.492s including waiting). Image size: 311446308 bytes.
  Normal   Created         21m                 kubelet            Created container xtrabackup
  Normal   Started         21m                 kubelet            Started container pt-heartbeat
  Normal   Pulled          21m                 kubelet            Successfully pulled image "percona/percona-toolkit:3.6.0" in 1.519s (1.519s including waiting). Image size: 268173151 bytes.
  Warning  Unhealthy       20m                 kubelet            Startup probe failed:
  Normal   Killing         20m                 kubelet            Container mysql failed startup probe, will be restarted
  Normal   Pulling         19m (x2 over 21m)   kubelet            Pulling image "percona/percona-server:8.0.36-28"
  Normal   Started         19m (x2 over 21m)   kubelet            Started container mysql
  Normal   Created         19m (x2 over 21m)   kubelet            Created container mysql
  Normal   Pulled          19m                 kubelet            Successfully pulled image "percona/percona-server:8.0.36-28" in 1.608s (1.608s including waiting). Image size: 403423459 bytes.
  Normal   Pulling         19m (x2 over 21m)   kubelet            Pulling image "percona/percona-toolkit:3.6.0"
  Normal   Created         19m (x2 over 21m)   kubelet            Created container pt-heartbeat
  Normal   Pulled          19m                 kubelet            Successfully pulled image "percona/percona-toolkit:3.6.0" in 1.751s (1.751s including waiting). Image size: 268173151 bytes.
  Warning  BackOff         11m (x41 over 19m)  kubelet            Back-off restarting failed container pt-heartbeat in pod racktables-prestable-mysql-1_racktables-prestable(ebd7c624-abcd-489f-aae1-f9870fd097fc)
  Warning  BackOff         78s (x70 over 16m)  kubelet            Back-off restarting failed container mysql in pod racktables-prestable-mysql-1_racktables-prestable(ebd7c624-abcd-489f-aae1-f9870fd097fc)
2024-09-03T14:18:50.181793Z 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: 'fcff:0:675:1:c278:c2c4:8308:288'  port: 33062
2024-09-03T14:19:07.310339Z 11 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Task Connect.'
2024-09-03T14:19:07.394163Z 11 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Source ACK Connect.'
2024-09-03T14:19:07.394248Z 11 [Note] [MY-013457] [InnoDB] Clone Apply Begin Master Version Check
2024-09-03T14:19:07.426148Z 11 [Note] [MY-013457] [InnoDB] Clone Apply Version End Master Task ID: 0 Passed, code: 0: 
2024-09-03T14:19:07.594923Z 11 [Note] [MY-013457] [InnoDB] Clone Apply Begin Master Task
2024-09-03T14:19:07.595131Z 11 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Started
2024-09-03T14:19:07.595153Z 11 [Note] [MY-011977] [InnoDB] Clone Drop all user data
2024-09-03T14:19:07.797571Z 11 [Note] [MY-011977] [InnoDB] Clone: Fix Object count: 191 task: 0
2024-09-03T14:19:07.801136Z 11 [Note] [MY-011977] [InnoDB] Clone Drop User schemas
2024-09-03T14:19:07.801377Z 11 [Note] [MY-011977] [InnoDB] Clone: Fix Object count: 4 task: 0
2024-09-03T14:19:07.801490Z 11 [Note] [MY-011977] [InnoDB] Clone Drop User tablespaces
2024-09-03T14:19:07.803175Z 11 [Note] [MY-011977] [InnoDB] Clone: Fix Object count: 6 task: 0
2024-09-03T14:19:07.944354Z 11 [Note] [MY-011977] [InnoDB] Clone Drop: finished successfully
2024-09-03T14:19:07.944403Z 11 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Finished
2024-09-03T14:19:07.949864Z 11 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Command COM_INIT.'
2024-09-03T14:19:07.970329Z 11 [Note] [MY-013458] [InnoDB] Clone Apply State Change : Number of tasks = 1
2024-09-03T14:19:07.970613Z 11 [Note] [MY-013458] [InnoDB] Clone Apply State FILE COPY: 
2024-09-03T14:19:07.970751Z 11 [Note] [MY-011978] [InnoDB] Clone estimated size: 200.85 GiB Available space: 272.84 GiB
2024-09-03T14:19:13.188566Z 11 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Tune Threads from: 1 to: 2 prev: 1 target: 2.'
2024-09-03T14:19:13.267361Z 0 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Task Connect.'
2024-09-03T14:19:13.268066Z 0 [Note] [MY-013457] [InnoDB] Clone Apply Begin Task ID: 1
2024-09-03T14:19:13.268180Z 0 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Command COM_ATTACH.'
2024-09-03T14:19:18.406307Z 11 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Tune continue, Data: 33 MiB/sec, Target: 28 MiB/sec.'
2024-09-03T14:19:18.406368Z 11 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Tune Threads from: 2 to: 4 prev: 2 target: 4.'
2024-09-03T14:19:18.558648Z 0 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Task Connect.'
2024-09-03T14:19:18.558665Z 0 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Task Connect.'
2024-09-03T14:19:18.561384Z 0 [Note] [MY-013457] [InnoDB] Clone Apply Begin Task ID: 2
2024-09-03T14:19:18.561446Z 0 [Note] [MY-013457] [InnoDB] Clone Apply Begin Task ID: 3
2024-09-03T14:19:18.561498Z 0 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Command COM_ATTACH.'
2024-09-03T14:19:18.561593Z 0 [Note] [MY-013272] [Clone] Plugin Clone reported: 'Client: Command COM_ATTACH.'
2024-09-03T14:19:21.236974Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.36-28).
2024-09-03T14:19:21.238005Z 0 [Note] [MY-010067] [Server] Giving 4 client threads a chance to die gracefully
2024-09-03T14:19:21.238081Z 0 [Note] [MY-010117] [Server] Shutting down replica threads
2024-09-03T14:19:21.238099Z 0 [Note] [MY-010054] [Server] Event Scheduler: Killing the scheduler thread, thread id 7
2024-09-03T14:19:21.238112Z 0 [Note] [MY-010050] [Server] Event Scheduler: Waiting for the scheduler thread to reply
2024-09-03T14:19:21.238258Z 0 [Note] [MY-010048] [Server] Event Scheduler: Stopped
2024-09-03T14:19:21.305639Z 11 [Note] [MY-013458] [InnoDB] Clone Set Error code: 1317 Saved Error code: 0
2024-09-03T14:19:21.305721Z 11 [Note] [MY-013458] [InnoDB] Clone Set Error code: 1317 Saved Error code: 1317
2024-09-03T14:19:21.307106Z 11 [Note] [MY-013458] [InnoDB] Clone Set Error code: 1317 Saved Error code: 1317
2024-09-03T14:19:21.307141Z 11 [Note] [MY-013458] [InnoDB] Clone Apply set error code: 1317: Query execution was interrupted

Steps to reproduce

  1. create empty async cluster
  2. restore from big backup
  3. slave is broken

Versions

  1. Kubernetes v1.30.3
  2. Operator percona-server-mysql-operator:0.8.0 (with some debug patches)
  3. Database percona/percona-server:8.0.36-28

Anything else?

No response

chernomor commented 2 months ago
# less bootstrap.log
...
2024/09/03 17:24:47 Clone required: true
2024/09/03 17:24:47 Checking if a clone in progress
2024/09/03 17:24:47 Clone in progress: false
2024/09/03 17:24:47 Cloning from racktables-prestable-mysql-0.racktables-prestable-mysql.racktables-prestable
2024/09/03 17:24:57 bootstrap finished in 10.125126 seconds
2024/09/03 17:24:57 bootstrap failed: clone from donor racktables-prestable-mysql-0.racktables-prestable-mysql.racktables-prestable: clone instance: invalid connection

It's like to one of these timeouts: https://github.com/percona/percona-server-mysql-operator/blob/2d1be08725a3be28e158914f66431b3a3bcba1e2/cmd/db/db.go#L36