vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.67k stars 2.1k forks source link

Errors while restoring a file do not fail fast #16855

Closed frouioui closed 1 month ago

frouioui commented 1 month ago

Overview of the Issue

When we restore a file with the builtin engine and one of the files fails to be restored (for whatever reason), we do not cancel all other ongoing file restore process. This can lead to very long waiting time until the restore properly fails even though we encountered the error a while ago.

We should be able to cancel all ongoing operations and fail fast as soon as we detect the error.

Reproduction Steps

  1. Vttablet flags

     --compression-engine-name=external \
     --external-compressor="nice -n 19 zstd -c -T7 -1" \
     --external-decompressor="nice -n 19 zstd -c -d" \
     --backup_engine_implementation=builtin \
     --builtinbackup-file-read-buffer-size=100 \
     --builtinbackup-file-write-buffer-size=100 \
     --builtinbackup_progress=500ms \
  2. Vtctld flags

    --builtinbackup-file-read-buffer-size=100 \
    --builtinbackup-file-write-buffer-size=100 \
    --builtinbackup_progress=500ms \
  3. Run the local examples

    ./101_initial_cluster.sh  
  4. Insert a decent amount of data in each table. I have 50000 rows in product, 10000 in customer, 100000 in corder.

  5. Take a backup

    vtctldclient BackupShard commerce/0  
  6. Corrupt one of the backup file, so that it generates a premature end error with zstd

    file_size=$(wc -c < /Users/florentpoinsard/Code/vitess-private/vtdataroot/backups/commerce/0/2024-09-25.184818.zone1-0000000102/172)
    truncated_size=$((file_size - 5))
    dd if=/Users/florentpoinsard/Code/vitess-private/vtdataroot/backups/commerce/0/2024-09-25.184818.zone1-0000000102/172 of=/Users/florentpoinsard/Code/vitess-private/vtdataroot/backups/commerce/0/2024-09-25.184818.zone1-0000000102/172.bis bs=1 count=$truncated_size
    mv /Users/florentpoinsard/Code/vitess-private/vtdataroot/backups/commerce/0/2024-09-25.184818.zone1-0000000102/172.bis /Users/florentpoinsard/Code/vitess-private/vtdataroot/backups/commerce/0/2024-09-25.184818.zone1-0000000102/172
  7. Do a restore

    vtctldclient RestoreFromBackup zone1-0000000102
  8. Observe the logs. They are somewhat similar to what we see on Grafana, the file 172 fails but the other file keep being processed, until it fails with the same error (mismatch sha).

    commerce/0 (zone1-0000000102): time:{seconds:1727290598 nanoseconds:165457000} file:"builtinbackupengine.go" line:779 value:"restoring  \"172\": 2777.79kb"
    commerce/0 (zone1-0000000102): time:{seconds:1727290598 nanoseconds:508870000} file:"builtinbackupengine.go" line:779 value:"restoring  \"34\": 1419.96kb"
    commerce/0 (zone1-0000000102): time:{seconds:1727290598 nanoseconds:527734000} file:"builtinbackupengine.go" line:779 value:"restoring  \"1\": 3.86kb"
    commerce/0 (zone1-0000000102): time:{seconds:1727290598 nanoseconds:646148000} file:"xtrabackupengine.go" line:787 value:"decompressor stderr: /*stdin*\\ : Read error (39) : premature end "
    commerce/0 (zone1-0000000102): time:{seconds:1727290598 nanoseconds:665443000} file:"builtinbackupengine.go" line:779 value:"restoring  \"172\": 2777.79kb"
    commerce/0 (zone1-0000000102): time:{seconds:1727290598 nanoseconds:770648000} file:"builtinbackupengine.go" line:1141 value:"closing decompressor"
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:8853000} file:"builtinbackupengine.go" line:779 value:"restoring  \"34\": 1419.96kb"
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:27726000} file:"builtinbackupengine.go" line:779 value:"restoring  \"1\": 3.86kb"
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:165451000} file:"builtinbackupengine.go" line:779 value:"restoring  \"172\": 2777.79kb"
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:166713000} file:"builtinbackupengine.go" line:1141 value:"closing decompressor"
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:166743000} file:"builtinbackupengine.go" line:774 value:"Completed restoring  \"1\""
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:320550000} file:"builtinbackupengine.go" line:1141 value:"closing decompressor"
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:320551000} file:"builtinbackupengine.go" line:774 value:"Completed restoring  \"34\""
    commerce/0 (zone1-0000000102): time:{seconds:1727290599 nanoseconds:320903000} file:"restore.go" line:278 value:"Restore: got a restore manifest: <nil>, err=can't restore file 172 to vt_commerce/corder.ibd: hash mismatch for vt_commerce/corder.ibd, got 78ec71b7 expected c003ea3f\nfailed to restore files, waitForBackupInterval=0s"
    E0925 12:56:39.338507   92346 main.go:56] rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0000000102 error: Can't restore backup: failed to restore files: can't restore file 172 to vt_commerce/corder.ibd: hash mismatch for vt_commerce/corder.ibd, got 78ec71b7 expected c003ea3f: Can't restore backup: failed to restore files: can't restore file 172 to vt_commerce/corder.ibd: hash mismatch for vt_commerce/corder.ibd, got 78ec71b7 expected c003ea3f

Despite the error on the fourth line of the logs, files 172, 34, and 1 keep getting restored. We only see the error when the last opened file is done restoring (Completed restoring 34).

Binary Version

18, 19, 20, main

Operating System and Environment details

n/a

Log Fragments

No response