scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
52 stars 34 forks source link

Changed sctool progress error cause for the case of restore to the enospc node #4087

Open mikliapko opened 1 week ago

mikliapko commented 1 week ago

Trying to run restore operation when one of the cluster nodes reached enospc.

For previous Manager 3.3 sctool progress was returning:

Command "sudo sctool  -c b9aa5ae1-47e4-44e9-ab5f-55eec8d1b348 progress restore/64fba353-7554-4e18-ab07-fc1dc9246b88"
Restore progress
> Run:      ac3cce8a-5e29-11ef-bb8e-42010a8e0047
> Status:       ERROR (restoring backed-up data)
> Cause:        not restored bundles [3git_0w5m_29g3k2upi5ed4wir0b 3git_0w5n_0dae827vzb5uhtv1a3 3git_0w5n_1b5j41yiuue2fw70aj 3git_0w5p_5bm5c24avh8hrvg56j 3git_0w5p_5wels2cdt19og8k3zv 3git_0w5n_4jys02so7o0vaqc5kr 3git_0w5p_1lvc025b4v1qmyjnsr]: create run progress: validate free disk space: not enough disk space
> Start time:   19 Aug 24 12:50:54 UTC
> End time: 19 Aug 24 12:54:10 UTC
> Duration: 3m15s
> Progress: 0% | 0%
> Snapshot Tag: sm_20240819124215UTC
> 
> ╭───────────┬──────────┬────────┬─────────┬────────────┬────────╮
> │ Keyspace  │ Progress │   Size │ Success │ Downloaded │ Failed │
> ├───────────┼──────────┼────────┼─────────┼────────────┼────────┤
> │ keyspace1 │  0% | 0% │ 2.777G │       0 │          0 │      0 │
> ╰───────────┴──────────┴────────┴─────────┴────────────┴────────╯

with direct error indication create run progress: validate free disk space: not enough disk space.

The latest Manager (3.4 release candidate) returns a bit different cause failed to restore sstables from location gcs:manager-backup-tests-sct-project-1-us-east1 table keyspace1.standard1 (993831798 bytes). See logs for more info that fails one of SCT tests:

Restore progress
Run:        1a69a489-953e-11ef-8156-42010a8e0086
Status:     ERROR (restoring backed-up data)
Cause:      failed to restore sstables from location gcs:manager-backup-tests-sct-project-1-us-east1 table keyspace1.standard1 (993831798 bytes). See logs for more info
Start time: 28 Oct 24 15:05:43 UTC
End time:   28 Oct 24 15:12:32 UTC
Duration:   6m48s
Progress:   0% | 0%
Snapshot Tag:   sm_20241028145652UTC
╭───────────┬──────────┬────────┬─────────┬────────────┬────────╮
│ Keyspace  │ Progress │   Size │ Success │ Downloaded │ Failed │
├───────────┼──────────┼────────┼─────────┼────────────┼────────┤
│ keyspace1 │  0% | 0% │ 2.777G │       0 │          0 │      0 │
╰───────────┴──────────┴────────┴─────────┴────────────┴────────╯

Argus run.

@Michal-Leszczynski I suppose the new error message looks good as well. Just want to make sure it was intentional change rather than unexpected refactoring side-effect or whatever. Please, let me know if it's by purpose change and I'll adjust SCT test accordingly.

Michal-Leszczynski commented 1 week ago

In general the restore task does not fail directly because of not enough disk space because we implemented batch and host re-tries some time ago. This means that the failed node won't participate in the restore anymore, and that the failed batch will be restored by other nodes. The problem is that the node without disk space will still cause problems for other nodes, and the restore will fail because of std::runtime_error (send_meta_data: got error code=-1 from node=10.138.0.37).

Michal-Leszczynski commented 1 week ago

We should improve error handling to reduce the amount of re-try related errors in the logs. We could also work on improving the error message, but we won't do it for the 3.4 release. But none of those things will be included in the 3.4 release and they are not that urgent in general.

mikliapko commented 1 week ago

Alright, I will disable failing error message validation until we improve logging approach