vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.47k stars 2.09k forks source link

Fail fast when builtinbackup fails to restore a single file #16856

Open frouioui opened 1 day ago

frouioui commented 1 day ago

Description

This PR fixes https://github.com/vitessio/vitess/issues/16855. Since each file is restored concurrently, we need to cancel all the restore at once as soon as a goroutine fails. This way we prevent the restore process to take forever and stall.

I do not have a programatic E2E test for this PR as it was way to flaky even locally. However, I have described a step-by-step reproduction process on #16855.

This is a bug all the way from 18 to main, backporting to all branches.

Logs

Before

On the first line, we detect the error. But we continue restoring all the ongoing file after that, without canceling them.

commerce/0 (zone1-0000000102): time:{seconds:1727389202 nanoseconds:504998000} file:"backup.go" line:519 value:"decompressor stderr: /*stdin*\\ : Read error (39) : premature end "
commerce/0 (zone1-0000000102): time:{seconds:1727389202 nanoseconds:524027000} file:"builtinbackupengine.go" line:1145 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727389202 nanoseconds:524051000} file:"builtinbackupengine.go" line:779 value:"Completed restoring  \"173\""
commerce/0 (zone1-0000000102): time:{seconds:1727389202 nanoseconds:564330000} file:"builtinbackupengine.go" line:1145 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727389202 nanoseconds:893811000} file:"builtinbackupengine.go" line:784 value:"restoring  \"34\": 1421.60kb"
commerce/0 (zone1-0000000102): time:{seconds:1727389202 nanoseconds:993136000} file:"builtinbackupengine.go" line:784 value:"restoring  \"171\": 2780.91kb"
commerce/0 (zone1-0000000102): time:{seconds:1727389202 nanoseconds:994575000} file:"builtinbackupengine.go" line:784 value:"restoring  \"172\": 252.01kb"
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:393780000} file:"builtinbackupengine.go" line:784 value:"restoring  \"34\": 1421.60kb"
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:493087000} file:"builtinbackupengine.go" line:784 value:"restoring  \"171\": 2780.91kb"
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:494524000} file:"builtinbackupengine.go" line:784 value:"restoring  \"172\": 252.01kb"
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:601967000} file:"builtinbackupengine.go" line:779 value:"Completed restoring  \"171\""
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:601978000} file:"builtinbackupengine.go" line:1145 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:804764000} file:"builtinbackupengine.go" line:1145 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:804770000} file:"builtinbackupengine.go" line:779 value:"Completed restoring  \"34\""
commerce/0 (zone1-0000000102): time:{seconds:1727389203 nanoseconds:805104000} file:"restore.go" line:283 value:"Restore: got a restore manifest: <nil>, err=can't restore file 172 to vt_commerce/customer.ibd: hash mismatch for vt_commerce/customer.ibd, got 0ca874fd expected ea1dd5df\nfailed to restore files, waitForBackupInterval=0s"
E0926 16:20:03.818039   37724 main.go:56] rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0000000102: Can't restore backup: failed to restore files: can't restore file 172 to vt_commerce/customer.ibd: hash mismatch for vt_commerce/customer.ibd, got 0ca874fd expected ea1dd5df

After

Here we can see that current/ongoing restore are getting canceled. Besides file 72, which I think is due to concurrency, we log the error (first line) from the stderr of ztsd but there is enough time until we cancel all the context for 72 to begin and finish. But once we detect the error at the top level loop in restoreFiles, all executions are canceled.

commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:213399000} file:"backup.go" line:519 value:"decompressor stderr: /*stdin*\\ : Read error (39) : premature end "
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:221890000} file:"builtinbackupengine.go" line:1171 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:221926000} file:"builtinbackupengine.go" line:795 value:"Completed restoring  \"162\""
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:222180000} file:"builtinbackupengine.go" line:1077 value:"Copying file 76: performance_schema/events_stages_su_116.sdi"
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:222465000} file:"compression.go" line:168 value:"Decompressing using external command: \"nice -n 19 zstd -c -d\""
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:232724000} file:"builtinbackupengine.go" line:1171 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:232753000} file:"builtinbackupengine.go" line:795 value:"Completed restoring  \"76\""
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:274510000} file:"builtinbackupengine.go" line:1171 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:274759000} file:"builtinbackupengine.go" line:792 value:"Canceled \"171\""
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:274767000} file:"builtinbackupengine.go" line:792 value:"Canceled \"34\""
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:274789000} file:"builtinbackupengine.go" line:792 value:"Canceled \"172\""
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:279935000} file:"builtinbackupengine.go" line:1171 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:280043000} file:"builtinbackupengine.go" line:1171 value:"closing decompressor"
commerce/0 (zone1-0000000102): time:{seconds:1727391201 nanoseconds:280295000} file:"restore.go" line:283 value:"Restore: got a restore manifest: <nil>, err=can't restore file 172 to vt_commerce/customer.ibd: hash mismatch for vt_commerce/customer.ibd, got 048f1e72 expected 6e9641a7;can't restore file 171 to vt_commerce/corder.ibd: hash mismatch for vt_commerce/corder.ibd, got aaffb3c0 expected 2d504b92\nfailed to restore files, waitForBackupInterval=0s"
E0926 16:53:21.293970   73953 main.go:56] rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0000000102: Can't restore backup: failed to restore files: can't restore file 172 to vt_commerce/customer.ibd: hash mismatch for vt_commerce/customer.ibd, got 048f1e72 expected 6e9641a7;can't restore file 171 to vt_commerce/corder.ibd: hash mismatch for vt_commerce/corder.ibd, got aaffb3c0 expected 2d504b92
vitess-bot[bot] commented 1 day ago

Review Checklist

Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov[bot] commented 1 day ago

Codecov Report

Attention: Patch coverage is 60.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 69.45%. Comparing base (2e2b223) to head (9479e8b). Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/mysqlctl/builtinbackupengine.go 60.00% 12 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #16856 +/- ## ========================================== + Coverage 69.43% 69.45% +0.01% ========================================== Files 1571 1571 Lines 203021 203088 +67 ========================================== + Hits 140970 141046 +76 + Misses 62051 62042 -9 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.