scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Don't backup when compaction is running #3770

Open Michal-Leszczynski opened 3 months ago

Michal-Leszczynski commented 3 months ago

Creating snapshots when compaction is running can lead to increased disk consumption. It might be a good idea for SM to wait for them to finish first as described in https://github.com/scylladb/scylla-enterprise/issues/3809#issuecomment-1931976661. Note that there is no way to prevent compaction from happening when the snapshots are already taken.

Connected issues:

cc: @karol-kokoszka @tzach

karol-kokoszka commented 2 months ago

Candidate for 3.2.9 (or 3.3.1, depending on tablets)

karol-kokoszka commented 2 months ago

grooming notes

To check for ongoing compaction you should:

query /task_manager/list_module_tasks/compaction; filter out the ones for which state in {done, failed}; wait for the rest (/task_manager/wait_task/{task_id}). If regular compaction should be also waited for then you should rather:

query /task_manager/list_module_tasks/compaction with internal flag on; filter out the ones for which state in {done, failed} or have non-zero parent_id; wait for the rest (/task_manager/wait_task/{task_id}).


The worst case scenario: When the compaction is running, Scylla rewrites the SSTables files. If, just before a snapshot requested on the same SSTables has been taken, the hard links to these SSTables are created. It leads to the situation that already compacted SSTable cannot be removed, because there still exists hardlink pointing to this file. What eventually leads to the situation where file exists and consumes the disk space, even though it's completely not needed (it's needed only to complete the backup).

The disk utilization could be doubled, but the probability of such a situation seems to be low (but still exists).


The problem that we want to address here refers mainly to the major compaction process,.https://opensource.docs.scylladb.com/stable/kb/compaction.html


The proposal includes to backoff/retry the backup task until there is no major compaction running. But, the major compaction may last for a long time. It creates the risk that the backup won't be created at the expected time.

Due to the risk of not having a backup at the scheduled time, we need to bring the issue to the planning. The priority of this issue is rather low. It looks like an edge case.

(cc: @tzach )