vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.52k stars 2.09k forks source link

Bug Report: PITR from timestamp or position not working #14765

Closed GenLN closed 6 months ago

GenLN commented 10 months ago

Overview of the Issue

Doing incremental(builtin method)/full(xtrabackup) method. Failing PITR using timestamp or position. K8S deployment with operator.

Reproduction Steps

  1. Deploy the following vschema:
{
    "tables": {
        "product": {},
        "customer": {},
        "corder": {}
    }
}
  1. Deploy Schema:
create table if not exists product(
  sku varbinary(128),
  description varbinary(128),
  price bigint,
  primary key(sku)
) ENGINE=InnoDB;
create table if not exists customer(
  customer_id bigint not null auto_increment,
  email varbinary(128),
  primary key(customer_id)
) ENGINE=InnoDB;
create table if not exists corder(
  order_id bigint not null auto_increment,
  customer_id bigint,
  sku varbinary(128),
  price bigint,
  primary key(order_id)
) ENGINE=InnoDB;
  1. Run full backup manually for the first time and then run cron with every 12h full backup and incrementals every hour. Add cron to insert random data into tables every 30min which will ensure incrementals wouldn't fail (they fail if there is no change in db from the last backup)

  2. Try to PITR from timestamp vtctldclient RestoreFromBackup --restore-to-timestamp "timestamp" tablet-alias

  3. View the error

Binary Version

vitess version v18.0.1
latest tag for vitess operator

Operating System and Environment details

K8S environment

Log Fragments

Here is the log output when running PITR from timestamp:

https://gist.github.com/GenLN/7b9ba323aee7778390a9269379b16c2d

shlomi-noach commented 10 months ago

I am unable to reproduce, but I think this issue description leaves a lot of implementation freedom.

To avoid waiting days for reproduction, I used this script:

#!/bin/bash
vtctldclient BackupShard commerce/0
for f in {1..3}; do
  echo "new full backup cycle"
  for i in {1..12}; do
    echo "new incremental backup cycle"
    sleep 30
    mysql -e "insert into corder values(rand()*1000000,rand()*1000000,rand()*1000000,rand()*1000000);"
    mysql -e "insert into corder values(rand()*1000000,rand()*1000000,rand()*1000000,rand()*1000000);"
    sleep 30
    vtctldclient BackupShard commerce/0 --incremental-from-pos=auto
  done
  vtctldclient BackupShard commerce/0
done

Which:

Essentially turns 12h->12m, 1h->1m, 30min->1min

The runtime totals at about 36 minutes (more due to overhead). On my test host, the runtime ranged Dec 17 12:31:34 to Dec 17 13:07:51, or in RFC3339 format these are 2023-12-17T12:31:34Z to 2023-12-17T13:07:51Z.

My tablets are:

$ mysql -e "show vitess_tablets"
+-------+----------+-------+------------+---------+------------------+------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State   | Alias            | Hostname   | PrimaryTermStartTime |
+-------+----------+-------+------------+---------+------------------+------------+----------------------+
| zone1 | commerce | 0     | PRIMARY    | SERVING | zone1-0000000100 | <redacted> | 2023-12-17T12:24:32Z |
| zone1 | commerce | 0     | REPLICA    | SERVING | zone1-0000000101 | <redacted> |                      |
| zone1 | commerce | 0     | RDONLY     | SERVING | zone1-0000000102 | <redacted> |                      |
+-------+----------+-------+------------+---------+------------------+------------+----------------------+

I ran the following restores:

vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:31:35Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:32:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:35:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:42:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:43:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:44:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:45:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:46:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:47:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:48:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:49:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:50:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:51:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:52:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:53:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:54:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:55:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T13:04:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T13:07:00Z" zone1-0000000101

and all went well and without error.

Let's try to focus on the mysqlbinlog error you got:

mysqlbinlog command failed: exit status 1\nfailed to apply binlog file /tmp/restore-incremental-3212119968/vt/vtdataroot/vt_2469782763/bin-logs/vt-2548885007-bin.000091

Can you find anything in the mysqld logs? Is it possible mysqlbinlog is not in your path? (Can you check that mysqlbinlog is on the same path as mysqld?)

shlomi-noach commented 9 months ago

Checking to see whether any of the above makes sense?

GenLN commented 8 months ago

Hi, will retry everything from the beginning and update with the results.

GenLN commented 8 months ago

Hi @shlomi-noach I did again test and still have an issues with restoring pitr backup.

I will describe my steps and also paste some gists of components I define and logs from restoring backups All vtctldclient commands are run from my localhost machine, before that I do port-forwards using script ./pf.sh

$ ./pf.sh &
alias vtctldclient="vtctldclient --server=localhost:15999"
alias vtctlclient="vtctlclient --server=localhost:15999"
alias mysql="mysql -h 127.0.0.1 -P 15306 -u user"
  1. Deploy k8s Operator

  2. Deploy Vitess cluster using values.yaml Here you will notice I use different dir for error/slow logs as I have sidecar container of the fluentbit which parse and send those logs to Loki instance (mounting empty dir for these logs)...

  3. Deploy schema,vschema for keyspaces commerce/- and customer sharded equal (x-80,80-x) -- schemas and vschemas used from Vitess docs for Vitess on k8s

  4. Deploy cronjob dummy insert data (simulating writes on vitess)

  5. Deploy pitr-backup-cronjob

  6. Try to RestoreFromBackup have this output logs from attempts: 1, 2, 3, 4, 5

    $ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T06:30:00Z" zone1-2469782763
    $ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T06:30:15Z" zone1-0790125915
    $ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T06:00:15Z" zone1-0790125915
    $ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T09:00:14Z" zone1-2469782763
    $ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T09:00:16Z" zone1-2859626137 
  7. Restore Failed

  8. Here available mysqld error.log from commerce/- vttablet: error.log

Here is some info:

$ mysql -e "show vitess_tablets"
+-------+----------+-------+------------+---------+------------------+--------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State   | Alias            | Hostname     | PrimaryTermStartTime |
+-------+----------+-------+------------+---------+------------------+--------------+----------------------+
| zone1 | commerce | -     | PRIMARY    | SERVING | zone1-2548885007 | REDACTED     | 2024-02-14T11:06:07Z |
| zone1 | commerce | -     | REPLICA    | SERVING | zone1-2469782763 | REDACTED     |                      |
| zone1 | commerce | -     | REPLICA    | SERVING | zone1-0790125915 | REDACTED     |                      |
| zone1 | customer | -80   | PRIMARY    | SERVING | zone1-0120139806 | REDACTED     | 2024-02-14T14:30:45Z |
| zone1 | customer | -80   | REPLICA    | SERVING | zone1-2859626137 | REDACTED     |                      |
| zone1 | customer | -80   | REPLICA    | SERVING | zone1-2289928654 | REDACTED     |                      |
| zone1 | customer | 80-   | PRIMARY    | SERVING | zone1-4277914223 | REDACTED     | 2024-02-14T14:30:04Z |
| zone1 | customer | 80-   | REPLICA    | SERVING | zone1-0118374573 | REDACTED     |                      |
| zone1 | customer | 80-   | REPLICA    | SERVING | zone1-2298643297 | REDACTED     |                      |
+-------+----------+-------+------------+---------+------------------+--------------+----------------------+

Here is list of backups for commerce/- taken every 30 min. (script for backup is using counter of backups, if this backup counter doesn't exist or equal 12 it will run FULL Backup, else it will run Incremental backup)

$ vtctldclient GetBackups commerce/-
Handling connection for 15999
2024-02-14.144515.zone1-0790125915
2024-02-14.145018.zone1-2469782763
2024-02-14.145518.zone1-0790125915
2024-02-14.150014.zone1-0790125915
2024-02-14.153015.zone1-2469782763
2024-02-14.160015.zone1-2469782763
2024-02-14.163014.zone1-2469782763
2024-02-14.170015.zone1-2469782763
2024-02-14.173014.zone1-0790125915
2024-02-14.180015.zone1-0790125915
2024-02-14.183016.zone1-0790125915
2024-02-14.190015.zone1-2469782763
2024-02-14.193016.zone1-2469782763
2024-02-14.200014.zone1-2469782763
2024-02-14.203016.zone1-2469782763
2024-02-14.210014.zone1-2469782763
2024-02-14.213015.zone1-0790125915
2024-02-14.220014.zone1-0790125915
2024-02-14.223014.zone1-0790125915
2024-02-14.230015.zone1-2469782763
2024-02-14.233016.zone1-0790125915
2024-02-15.000014.zone1-0790125915
2024-02-15.003014.zone1-0790125915
2024-02-15.010015.zone1-0790125915
2024-02-15.013014.zone1-0790125915
2024-02-15.020015.zone1-2469782763
2024-02-15.023014.zone1-0790125915
2024-02-15.030015.zone1-2469782763
2024-02-15.033014.zone1-0790125915
2024-02-15.040015.zone1-0790125915
2024-02-15.043014.zone1-0790125915
2024-02-15.050017.zone1-2469782763
2024-02-15.053014.zone1-0790125915
2024-02-15.060015.zone1-0790125915
2024-02-15.063015.zone1-0790125915
2024-02-15.070014.zone1-0790125915
2024-02-15.073014.zone1-0790125915
2024-02-15.080015.zone1-2469782763
2024-02-15.083014.zone1-2469782763
2024-02-15.090014.zone1-2469782763

Here listed bin-logs dir from vttablet tried to restore from timestamp (binlog retention isn't defined, so I'm using default from the Vitess):

$ ls vt/vtdataroot/vt_0790125915/bin-logs/
vt-0790125915-bin.000006  vt-0790125915-bin.000007  vt-0790125915-bin.000008  vt-0790125915-bin.index

Here is list of /tmp which was referenced in error messsage at the end of restore attempt logs:

$ ls /tmp
vt-gh-ost
vttablet.ERROR
vttablet.INFO
vttablet.WARNING
vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.ERROR.20240214-112904.1
vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.INFO.20240214-112904.1
vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.WARNING.20240214-112904.1

$ cat /tmp/vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.ERROR.20240214-112904.1 
Log file created at: 2024/02/14 11:29:04
Running on machine: dev-vitess-vttablet-zone1-0790125915-0d0930ad
Binary: Built with gc go1.21.4 for linux/amd64
Previous log: <none>
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0214 11:29:04.512681       1 syslogger.go:149] can't connect to syslog

INCREMENTAL backups run using this function from the script (I do get label values from vtorc pods as there is only one per Keyspace per shard, so I use them to get label values for keyspace/shard to run BackupShard)

# Function to perform incremental backup
perform_incremental_backup() {
    for vtorc_pod in $vtorc_pods; do
        echo "Processing vtorc pod: $vtorc_pod"

        # Fetch the keyspace and shard labels from the vtorc pod
        keyspace=$(kubectl get pod "$vtorc_pod" -o jsonpath='{.metadata.labels.planetscale\.com/keyspace}')
        shard=$(kubectl get pod "$vtorc_pod" -o jsonpath='{.metadata.labels.planetscale\.com/shard}')

        # Remove 'x' letters from the shard label
        shard="${shard//x/}"

        echo "=========================== Executing INCREMENTAL BACKUP command on $keyspace / $shard ======================================"

        # Run the BackupShard command with keyspace/shard as a single argument
        vtctldclient --server="${service_name}:${port}" --alsologtostderr BackupShard --incremental-from-pos=auto "${keyspace}/${shard}"
    done
}
shlomi-noach commented 8 months ago

Hi @GenLN ! Thanks for following up. In trying to narrow things down, were you able to follow my flow on https://github.com/vitessio/vitess/issues/14765#issuecomment-1859178702, and run that exact script on your servers?

Did you happen to check the mysqlbinlog binary as per https://github.com/vitessio/vitess/issues/14765#issuecomment-1859178702?

Realistically, I'm not going to be able to reproduce your specific setup on a k8s cluster, so I'm trying to eliminate what I think are irrelevant aspects of your setup, and narrow down to some basics.

shlomi-noach commented 8 months ago

Let's again focus on the specific error. From your logs above:

mysqlbinlog command failed: exit status 1: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-2625521499/vt/vtdataroot/vt_0790125915/bin-logs/vt-2469782763-bin.000004

Thank you!

GenLN commented 8 months ago

Hi @shlomi-noach I have tried also your flow for testing pitr. Same error happened..

E0216 10:57:46.259043   77402 main.go:56] rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0790125915 error: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-2976517831/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000001: rpc error: code = Unknown desc = mysqlbinlog command failed: exit status 1: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-2976517831/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000001: rpc error: code = Unknown desc = mysqlbinlog command failed: exit status 1

Where should I check mysqlbinlog path? my localhost machine which does execute vtctldclient cmds have mysqlbinlog.

As I said, im using this script to port-forward vtctld, vtgate, vtadmin and use vtctldclient tool present on my machine along with other vt

local@machine$ ls /usr/local/vitess/bin 
mysqlctl   topo2topo   vtadmin   vtbench   vtcombo  vtctlclient  vtctldclient  vtgate  vttablet      zk     zkctld
mysqlctld  vtaclcheck  vtbackup  vtclient  vtctl    vtctld       vtexplain     vtorc   vttestserver  zkctl

local@machine$ vtctldclient --version
vtctldclient version Version: 18.0.0 (Git revision 9a6f5262f7707ff80ce85c111d2ff686d85d29cc branch 'HEAD') built on Mon Nov  6 12:16:43 UTC 2023 by runner@fv-az422-64 using go1.21.3 linux/amd64

local@machine$ whereis mysqlbinlog
mysqlbinlog: /usr/bin/mysqlbinlog

local@machine$ mysqlbinlog --version
mysqlbinlog  Ver 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL)

local@machine$ echo $PATH
/usr/local/vitess/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin

These are images of the cluster

images:
    vtctld: vitess/vtctld:v18.0.1
    vtadmin: vitess/vtadmin:v18.0.1
    vtgate: vitess/lite:v18.0.1
    vttablet: vitess/lite:v18.0.1
    vtbackup: vitess/lite:v18.0.1
    vtorc: vitess/lite:v18.0.1
    mysqld:
      mysql80Compatible: vitess/lite:v18.0.1
    mysqldExporter: prom/mysqld-exporter:v0.14.0

vtctld doesn't have mysqlbinlog vtadmin doesn't have mysqlbinlog all vitess/lite images does have mysqlbinlog....

shlomi-noach commented 8 months ago

Where should I check mysqlbinlog path?

On the same container that runs your mysqld instances or vtmysqlctl instances.

GenLN commented 8 months ago

This is k8s environment, mysqld is a container inside vttablet pod.. Same docker image used for mysqld and vttablet container vitess/lite:v18.0.1 VTTablet pod containers:

mysqld container

vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ whereis mysqlbinlog
mysqlbinlog: /usr/bin/mysqlbinlog

vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ echo $PATH
/vt/src/vitess.io/vitess/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

vttablet container

vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ whereis mysqlbinlog
mysqlbinlog: /usr/bin/mysqlbinlog

vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ echo $PATH
/vt/src/vitess.io/vitess/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
shlomi-noach commented 8 months ago

We have a visibility problem here. We're looking for the mysqlbinlog error output. Right now I believe you should be able to find it in the mysqld container logs.

Regardless, vitess should clearly output any error messages from mysqlbinlog. Please consider this patch: https://github.com/vitessio/vitess/commit/3652f6800f9b35cf70e0ffcf9a47936cdff20cf2 (I will create an orderly PR later on).

If you're able to apply this patch, build, and deploy to your k8s environment, I think we'll have a better idea of the error.

shlomi-noach commented 8 months ago

Related issue: https://github.com/vitessio/vitess/issues/15277

shlomi-noach commented 8 months ago

PR: https://github.com/vitessio/vitess/pull/15278

GenLN commented 7 months ago

Hi @shlomi-noach I have updates on this issue. So I used this image setup on k8s vitess v19.0.0 Important note: In docs there is said to freely use mysql:8.0.30 or even mysql:8.0.34 image but Incremental backups are failling using this images as output is showing no path for mysqlbinlog...

images:
    vtctld: vitess/lite:v19.0.0
    vtadmin: vitess/vtadmin:v19.0.0
    vtgate: vitess/lite:v19.0.0
    vttablet: vitess/lite:v19.0.0
    vtbackup: vitess/lite:v19.0.0
    vtorc: vitess/lite:v19.0.0
    mysqld:
      mysql80Compatible: vitess/lite:v19.0.0-mysql80
    mysqldExporter: prom/mysqld-exporter:v0.14.0

With this when i try to run command vtctldclient RestoreFromBackup --restore-to-timestamp "2024-03-08T10:23:00Z" zone1-0790125915

Have this output at the end:

Handling connection for 15999
E0308 11:35:55.012632  110704 main.go:56] rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0790125915: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002: rpc error: code = Unknown desc = mysqlbinlog command failed: with error output: mysqlbinlog: File '/tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002' not found (OS errno 2 - No such file or directory)
ERROR: Could not open log file
: exit status 1

Whole log output is available here gists

mattlord commented 7 months ago

@GenLN the output does not show that the mysqlbinlog binary could not be found. It shows that it WAS found, and it was in fact it was mysqlbinlog that returned this error:

mysqlbinlog: File '/tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002' not found (OS errno 2 - No such file or directory)

For example:

❯ mysqlbinlog /tmp/noexist 1>/dev/null
mysqlbinlog: File '/tmp/noexist' not found (OS errno 2 - No such file or directory)
ERROR: Could not open log file

Why it's returning that I cannot say as I don't have access to your systems. Is it somehow not correct to use that specific binary log file? Is that not the right location? Does the file actually exist? I have no idea.

This is not clearly a bug to me at this point and may require a lot of back and forth so might be better to move it to the Vitess Slack.

shlomi-noach commented 7 months ago

If this requires a lot of back and forth we can mvoe to slack -- but I'd say this is a bug. The RestoreFromBackup command clearly expects the file to exist, and asks mysqlbinlog to apply it. I have no insight on why the file cannot be found -- could you e.g. find /tmp/restore-incremental-3487966834/ ?

GenLN commented 7 months ago

@mattlord Sorry for misunderstanding about this note Important note: In docs there is said to freely use mysql:8.0.30 or even mysql:8.0.34 image but Incremental backups are failling using this images as output is showing no path for mysqlbinlog...

no path for mysqlbinlog happens when using mysql:8.0.30 or mysql:8.0.34 image for mysqld (This doesn't happen when using mysqld image vitess/lite:v19.0.0-mysql80 )

@shlomi-noach there isn't any folder or file neither in mysqld container or vttablet container

mysqld container:

$ ls -alh /tmp/
total 8.0K
drwxrwxrwt 1 root root 4.0K Mar 11 10:32 .
drwxr-xr-x 1 root root 4.0K Mar  8 10:19 ..

$ ls -alh /vt/vtdataroot/vt_0790125915/
total 36K
drwxr-sr-x 7 vitess vitess 4.0K Mar 11 10:32 .
drwxrwsr-x 3 root   vitess 4.0K Mar  8 10:19 ..
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 bin-logs
drwxr-s--- 8 vitess vitess 4.0K Mar 11 10:32 data
drwxr-sr-x 4 vitess vitess 4.0K Mar 11 10:32 innodb
-rw-r--r-- 1 vitess vitess 2.9K Mar 11 10:32 my.cnf
-rw-r----- 1 vitess vitess    5 Mar 11 10:32 mysql.pid
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 relay-logs
-rw-r--r-- 1 vitess vitess    0 Mar 11 10:32 restore_in_progress
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 tmp

$ ls -alh /vt/vtdataroot/vt_0790125915/tmp/
total 8.0K
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 .
drwxr-sr-x 7 vitess vitess 4.0K Mar 11 10:32 ..

$ cat /vt/vtdataroot/vt_0790125915/restore_in_progress 
vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ 

vttablet:

$ ls -alh /tmp/
total 8.9M
drwxrwxrwt 1 root   root   4.0K Mar 11 10:32 .
drwxr-xr-x 1 root   root   4.0K Mar  8 10:19 ..
-rw-r--r-- 1 vitess vitess   57 Mar 11 09:58 percona-version-check
-rwxr-xr-x 1 vitess vitess 8.9M Mar  8 10:19 vt-gh-ost

/vt is shared folder for each container in vttablet pod so it is same output for mysqld and vttablet from/vt/vtdataroot/...

Also mysqldcontainer get restarted in restore process, this could be potential issue why does /tmp folder looks empty

shlomi-noach commented 7 months ago

@GenLN right. By the time you go looking for it, /tmp/restore-incremental-* will have already been cleaned up by the restore process.

Looking at the log gist file:

commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:781767909} file:"backup.go" line:475 value:"Restore: applying 2 incremental backups"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:803742162} file:"builtinbackupengine.go" line:931 value:"Restoring incremental backup to position: 5f3b7323-dd35-11ee-9ec5-aae9b09cabe1:1-29"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:803757012} level:WARNING file:"builtinbackupengine.go" line:1000 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:803835914} file:"builtinbackupengine.go" line:1050 value:"Copying file 0: vt-0790125915-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:830929001} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:831001633} file:"builtinbackupengine.go" line:1143 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:831053795} file:"builtinbackupengine.go" line:777 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:835281495} file:"restore.go" line:278 value:"Restore: got a restore manifest: <nil>, err=rpc error: code = Unknown desc = mysqlbinlog command failed: with error output: mysqlbinlog: File '/tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002' not found (OS errno 2 - No such file or directory)\nERROR: Could not open log file\n: exit status 1\nfailed to apply binlog file /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002, waitForBackupInterval=0s"

So the restore process was supposed to copy file vt-0790125915-bin.000002 into /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/, but when mysqlbinlog was told to apply said file at /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002, the file could not be found.

There is no error in the restore process to indicate that it was unable to create said file. So we're in the dark again. It feels like perhaps a k8s permission issue? Perhaps the restore process did create the correct files, but then mysqlbinlog did not have the authorization to read said files?

I did not ask: are you comfortable building a vitess binary? I'd add a bunch of log entries from the moment the files are copied to validate their existence up till the point where we instruct mysqlbinlog to read them, and beyond.

GenLN commented 7 months ago

@shlomi-noach sure just drop the files which I will pull and build the docker image to use it for vitess cluster

Does it affect which user is setup for xtrabackup_user? In this testing vitess cluster I have set xtrabackup_user: vt_dba

Also is this a normal content of the dir for incremental backups? As I notice here is 0 name of the incremental backup file which I guess reference to shard name "0" but in K8S initial shards or single sharded keyspaces have "-" and not "0"

Name           Size             Type
0              588 B            application/x-gzip
MANIFEST       1.2 KB           text/plain; charset=utf-8
shlomi-noach commented 7 months ago

Does it affect which user is setup for xtrabackup_user?

As far as I can see, no.

Also is this a normal content of the dir for incremental backups?

Yes. 0 file name has nothing to do with shard 0. If you look at a full backup, you'll see files named 0, 1, 2 and so forth.

shlomi-noach commented 7 months ago

I think I have it. You said:

/vt is shared folder for each container in vttablet pod so it is same output for mysqld and vttablet

But the incremental restore does not extract the files onto /vt. Instead, it extracts them onto /tmp/restore-incremental-*. This is done by vtctld or vttablet pod. The mysqld pod does not see those files.

GenLN commented 7 months ago

So it means it's wrong location to download incremental backup files which should be applied by mysqlbinlog

Perhaps that location is obviously okay for docker deployment (as you confirmed in docker env it works as it expected), but not for k8s?

Maybe I can try to persist /tmp folder and share between containers in pods, at least I can try to see will this works?

shlomi-noach commented 7 months ago

Why don't you try https://github.com/vitessio/vitess/pull/15440, see if that works?

mattlord commented 7 months ago

The xtrabackupengine uses the tmpdir value in the my.cnf file: https://github.com/vitessio/vitess/blob/main/go/vt/mysqlctl/xtrabackupengine.go

https://dev.mysql.com/doc/refman/8.0/en/temporary-files.html

That defaults to /tmp. You could change that to /vt/tmp here in your cluster definition: https://github.com/planetscale/vitess-operator/blob/main/docs/api.md#mysqldspec

shlomi-noach commented 7 months ago

@mattlord I think you're referencing an unrelated setting. The path we're using is generated by vitess, not by MySQL. Vitess reads a backup manifest, then copies over the backup files (binary logs in our case) under said generated path, then requests mysqld to run mysqlbinlog using those files.

Moreover, this is done by the built-in engine, not xtrabackup engine. I can add a configuration in builtin engine as an alternative to https://github.com/vitessio/vitess/pull/15440.

mattlord commented 7 months ago

Ah, I see: https://github.com/vitessio/vitess/blob/46975b20fa6b9dd77dc666cedded8b8f16bc757a/go/vt/mysqlctl/builtinbackupengine.go#L1008-L1013

That's the only place that the restore-incremental string literal is used.

GenLN commented 7 months ago

Why don't you try https://github.com/vitessio/vitess/pull/15440, see if that works?

So I should build vitess image using these changes and run the test? Using this procedure from bootstrap image to build base and then lite? Vitess docker build

shlomi-noach commented 7 months ago

So I should build vitess image using these changes and run the test?

Yes please.

Using this procedure from bootstrap image to build base and then lite?

Honestly, I think, I'm not sure. This is why I asked whether you're comfortable building a vitess binary.

GenLN commented 7 months ago

I think this solved the issue :+1: So i built image following that instructions, with changes you did.... Run backups and successfully restored PITR or I believe so... Now one vttablet is set to DRAINED type after PITR Here last log from PITR:

commerce/- (zone1-0790125915): time:{seconds:1710170821 nanoseconds:31718605} file:"backup.go" line:450 value:"Restore: starting mysqld for mysql_upgrade"
commerce/- (zone1-0790125915): time:{seconds:1710170823 nanoseconds:34273131} file:"backup.go" line:456 value:"Restore: running mysql_upgrade"
commerce/- (zone1-0790125915): time:{seconds:1710170823 nanoseconds:34904826} file:"backup.go" line:463 value:"Restore: restarting mysqld after mysql_upgrade"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:42300139} file:"backup.go" line:475 value:"Restore: applying 6 incremental backups"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:74517229} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-28"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:74537690} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:74620512} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-0790125915-bin.000001"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:124447381} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:124621285} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:124717567} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149121436} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-833467721/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000001"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149138636} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-833467721"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149354721} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-28"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149379262} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-28"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:182249347} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-30"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:182271147} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:182352859} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-0790125915-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:254408236} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:254531709} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:254575220} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273075721} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-1572766817/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273090761} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-1572766817"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273307676} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-30"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273327407} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-30"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:354513357} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-32"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:354532807} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:354615069} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-2548885007-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:422943859} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:423037871} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:423118783} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:441822098} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-2102494854/vt/vtdataroot/vt_0790125915/bin-logs/vt-2548885007-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:441841879} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-2102494854"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:442064364} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-32"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:442083524} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-32"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:473260860} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-34"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:473279410} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:473361122} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-2548885007-bin.000004"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:525794483} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:525905336} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:525985598} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:544864697} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-751988353/vt/vtdataroot/vt_0790125915/bin-logs/vt-2548885007-bin.000004"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:544890838} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-751988353"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:545132853} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-34"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:545149194} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-34"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:569414389} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-36"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:569432309} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:569509981} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-2548885007-bin.000005"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:603878621} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:603963263} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:604023374} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624610263} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-2566389258/vt/vtdataroot/vt_0790125915/bin-logs/vt-2548885007-bin.000005"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624624424} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-2566389258"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624812008} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-36"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624829268} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-36"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:656648969} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-38"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:656667249} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:656740561} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-0790125915-bin.000003"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:748701142} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:748927507} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:749001599} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770191402} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-3497800045/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000003"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770242563} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-3497800045"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770500529} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-38"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770520089} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-38"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770529479} file:"backup.go" line:486 value:"Restore: done applying incremental backups"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770539710} file:"backup.go" line:489 value:"Restore: removing state file"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770574670} file:"backup.go" line:496 value:"Restore: complete"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770610801} file:"restore.go" line:278 value:"Restore: got a restore manifest: &{2024-03-11.151135.zone1-2548885007 xtrabackup fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-26 fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-26 <nil>  false 2024-03-11T15:11:35Z 2024-03-11T15:11:40Z f4bd2f84-dfb8-11ee-9e26-0610ce1e24bf zone1-2548885007 commerce - Ver 8.0.30 MySQL Community Server - GPL true <nil>}, err=<nil>, waitForBackupInterval=0s"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770633272} file:"restore.go" line:298 value:"Restore: pos=MySQL56/fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-26"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770645932} file:"restore.go" line:317 value:"Restore: disabling replication"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:853305236} file:"restore.go" line:357 value:"Restore: will set tablet type to DRAINED as this is a point in time recovery"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:853361248} file:"restore.go" line:360 value:"Restore: changing tablet type to DRAINED for cell:\"zone1\" uid:790125915"
shlomi-noach commented 7 months ago

Looks good! OK, we need to decide what the actual solution should be. As far as I can see, it's either https://github.com/vitessio/vitess/pull/15440, or as @mattlord suggested, add a command line flag for you to configure the path yourself.

GenLN commented 7 months ago

Keep in mind that this only works if vitess/lite image (with included #15440) is used for mysqld. The same bug is hitting if used official mysql image(will test this with this PR as i believe vttablet container starts the Restore procedure and maybe this fix will work with official image)...

Also, another question: Could we expect this to be released as a patch? I don't see any breaking changes here, just a bugfix for incremental restoration for K8s env. Also, I will do a few more tests with the image I have built from your PR and bring more updates if I encounter some issues. Still need to perform full PITR as I didn't have a chance to get to this step. 😄

shlomi-noach commented 7 months ago

Keep in mind that this only works if vitess/lite image (with included https://github.com/vitessio/vitess/pull/15440) is used for mysqld. The same bug is hitting if used official mysql image(will test this with this PR as i believe vttablet container starts the Restore procedure and maybe this fix will work with official image)...

I'm not sure I understand. Could you please explain again?

Could we expect this to be released as a patch?

Yes.

shlomi-noach commented 7 months ago

Alternative solution: https://github.com/vitessio/vitess/pull/15451

GenLN commented 7 months ago

Keep in mind that this only works if vitess/lite image (with included #15440) is used for mysqld. The same bug is hitting if used official mysql image(will test this with this PR as i believe vttablet container starts the Restore procedure and maybe this fix will work with official image)...

I'm not sure I understand. Could you please explain again?

Could we expect this to be released as a patch?

Yes.

Here, I just tried using official mysql:8.0.30 image in k8s env: vitess/lite:pitr-test is the image I built based on your PR #15440

images:
    vtctld: myrepo/vitess/lite:pitr-test
    vtadmin: vitess/vtadmin:v19.0.0
    vtgate: myrepo/vitess/lite:pitr-test
    vttablet: myrepo/vitess/lite:pitr-test
    vtbackup: myrepo/vitess/lite:pitr-test
    vtorc: myrepo/vitess/lite:pitr-test
    mysqld:
      mysql80Compatible: mysql:8.0.30
    mysqldExporter: prom/mysqld-exporter:v0.14.0

Here I can't take incremental backups only full backup works. Check the output:

commerce/- (zone1-2548885007): time:{seconds:1710231387 nanoseconds:286567256} file:"backup.go" line:152 value:"Starting backup 2024-03-12.081627.zone1-2548885007"
commerce/- (zone1-2548885007): time:{seconds:1710231387 nanoseconds:286628665} file:"builtinbackupengine.go" line:209 value:"Executing Backup at 2024-03-12 08:16:27.28599096 +0000 UTC m=+407.952151989 for keyspace/shard commerce/- on tablet zone1-2548885007, concurrency: 4, compress: true, incrementalFromPos: auto"
commerce/- (zone1-2548885007): time:{seconds:1710231387 nanoseconds:287800266} file:"builtinbackupengine.go" line:256 value:"auto evaluating incremental_from_pos"
commerce/- (zone1-2548885007): time:{seconds:1710231387 nanoseconds:366923115} file:"builtinbackupengine.go" line:263 value:"auto evaluated incremental_from_pos: MySQL56/e2dbb229-e047-11ee-9a75-7a4408213317:1-26"
commerce/- (zone1-2548885007): time:{seconds:1710231387 nanoseconds:405806202} level:ERROR file:"backup.go" line:178 value:"backup is not usable, aborting it: [rpc error: code = Unknown desc = mysqlbinlog not found in any of /usr/{sbin,bin,libexec,scripts}\nreading timestamps from binlog files [vt-2548885007-bin.000002]]"
E0312 09:16:27.471347   38542 main.go:56] rpc error: code = Unknown desc = TabletManager.Backup on zone1-2548885007: reading timestamps from binlog files [vt-2548885007-bin.000002]: rpc error: code = Unknown desc = mysqlbinlog not found in any of /usr/{sbin,bin,libexec,scripts}

And as you can see the official mysql:8.0.30 doesn't have mysqlbinlog in path

bash-4.4$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
bash-4.4$ whereis mysqlbinlog
bash: whereis: command not found
bash-4.4$ mysql
mysql                          mysql_migrate_keyring          mysql_upgrade                  mysqldump
mysql-secret-store-login-path  mysql_ssl_rsa_setup            mysqladmin                     mysqlpump
mysql_config                   mysql_tzinfo_to_sql            mysqld                         mysqlsh
bash-4.4$ mysqlbi

And the reason I did test this is because in Vitess V19.0.0 Release docs says If you are currently using vitess/lite as your mysqld image in your vitess-operator deployment we invite you to use an official MySQL image, such as mysql:8.0.30. Here is the link

So conclusion is to use vitess/lite image or maybe custom built image from official mysql image as base and add mysqlbinlog

shlomi-noach commented 7 months ago

Thank you for explaining. To be more precise:

The same bug is hitting if used official mysql image

This is an altogether different issue, it is about not having mysqlbinlog binary. The issue we're having in this Issue is that the restore process places binlog files in a directory that is not shared between tablet and mysqld pods.

shlomi-noach commented 7 months ago

Would you mind creating a new Issue re: the need to include a mysqlbinlog binary in a k8s deployment? This would be a documentation change.

GenLN commented 7 months ago

Yes I agree, let's not expand this issue with other stuff that isn't bound to the same core problem... Should I somewhere report it or create another issue for this?

GenLN commented 7 months ago

Would you mind creating a new Issue re: the need to include a mysqlbinlog binary in a k8s deployment? This would be a documentation change.

Opened this #15452