leftytennis commented 3 months ago

The prune command has the task of deleting old/unwanted revisions and unused chunks from a storage.

Click here for a list of related forum topics.

Quick overview

NAME:
   duplicacy prune - Prune revisions by number, tag, or retention policy

USAGE:
   duplicacy prune [command options]

OPTIONS:
   -id <snapshot id>            delete revisions with the specified snapshot ID instead of the default one
   -all, -a                     match against all snapshot IDs
   -r <revision> [+]            delete the specified revisions
   -t <tag> [+]                 delete revisions with the specified tags
   -keep <n:m> [+]              keep 1 revision every n days for revisions older than m days
   -keep-max <n>                keep max n most recent revisions matching the tag -t
   -exhaustive                  remove all unreferenced chunks (not just those referenced by deleted snapshots)
   -exclusive                   assume exclusive access to the storage (disable two-step fossil collection)
   -dry-run, -d                 show what would have been deleted
   -delete-only                 delete fossils previously collected (if deletable) and don't collect fossils
   -collect-only                identify and collect fossils, but don't delete fossils previously collected
   -ignore <id> [+]             ignore revisions with the specified snapshot ID when deciding if fossils can be deleted
   -storage <storage name>      prune revisions from the specified storage
   -threads <n>                 number of threads used to prune unreferenced chunks

Usage

duplicacy prune [command options]

Options

Options marked with [+] can be passed more than once.

`-id <snapshot id>`

Delete revisions with the specified snapshot ID instead of the default one.

Example:

duplicacy prune -id computer-2

`-all, -a`

Run the prune command against all snapshot IDs in selected storage.

Example:

duplicacy prune -all

`-r <revision> [+]`

Delete the specified revisions.

Examples:

duplicacy prune -r 6              # delete revision 6
duplicacy prune -r 344-350        # delete revisions starting with 344 to 350 (included)
duplicacy prune -r 310 -r 1322    # delete only the revisions 310 and 1322

`-t <tag> [+]`

Delete revisions with the specified tags.

`-keep <n:m> [+]`

Keep 1 revision every n days for revisions older than m days.

The retention policies are specified by the -keep option, which accepts an argument in the form of two numbers n:m, where n indicates the number of days between two consecutive revisions to keep, and m means that the policy only applies to revisions at least m day old. If n is zero, any revisions older than m days will be removed. The -keep and -keep-max options are mutually exclusive.

Examples:

duplicacy prune -keep 1:7                # Keep a revision per (1) day for revisions older than 7 days
duplicacy prune -keep 7:30               # Keep a revision every 7 days for revisions older than 30 days
duplicacy prune -keep 30:180             # Keep a revision every 30 days for revisions older than 180 days
duplicacy prune -keep 0:360              # Keep no revisions older than 360 days

Multiple -keep options must be sorted by their m values in decreasing order.

For example, to combine the above policies into one line, it would become:

duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7

`-keep-max <n>`

Keep max n most recent revisions for the specified tag -t.

A single tag -t must be specified when using -keep-max. The -keep-max and -keep options are mutually exclusive.

Examples:

duplicacy prune -keep-max 24 -t hourly    # Keep 24 most recent revisions with tag hourly
duplicacy prune -keep-max  7 -t daily     # Keep  7 most recent revisions with tag daily
duplicacy prune -keep-max  4 -t weekly    # Keep  4 most recent revisions with tag weekly
duplicacy prune -keep-max  3 -t monthly   # Keep  3 most recent revisions with tag monthly
duplicacy prune -keep-max  4 -t quarterly # Keep  4 most recent revisions with tag quarterly
duplicacy prune -keep-max  3 -t yearly    # Keep  3 most recent revisions with tag yearly

The -keep-max option must specify a number >= 0. If n is 0, all revisions matching the tag will be pruned. If n is greater than 0, then n of the most recent snapshot revisions will be kept.

`-exhaustive`

Remove all unreferenced chunks (not just those referenced by deleted revisions).

The -exhaustive option will scan the list of all chunks in the storage, therefore it will find not only unreferenced chunks from deleted revivions, but also chunks that become unreferenced for other reasons, such as those from an incomplete backup.

It will also find any file that does not look like a chunk file.

In contrast, a normal prune command will only identify chunks referenced by deleted revisions but not any other revisions.

Example:

duplicacy prune -exhaustive

`-exclusive`

Assume exclusive access to the storage (disable two-step fossil collection).

The -exclusive option will assume that no other clients are accessing the storage, effectively disabling the two-step fossil collection algorithm.

With this option, the prune command will immediately remove unreferenced chunks.

WARNING: Only run -exclusive when you are sure that no other backup is running, on any other device or repository.

Example:

duplicacy prune -exclusive

`-dry-run, -d`

This option is used to test what changes the prune command would have done. It is guaranteed not to make any changes on the storage, not even creating the local fossil collection file.

Example:

After running this nothing will be modified in the storage, but duplicacy will show all output just like a normal run:

duplicacy prune -dry-run -all -exhaustive - exclusive

`-delete-only`

Delete fossils previously collected (if deletable) and don't collect fossils.

Example:

duplicacy prune -delete-only

`-collect-only`

Identify and collect fossils, but don't delete fossils previously collected.

Example:

duplicacy prune -collect-only

The -delete-only option will skip the fossil collection step, while the -collect-only option will skip the fossil deletion step.

`-ignore <id> [+]`

Ignore revisions with the specified snapshot ID when deciding if fossils can be deleted.

`-storage <storage name>`

Prune revisions from the specified storage instead of the default one.

Example:

duplicacy prune -storage google-drive

`-threads <n>`

This option is used to specify more than one thread to prune chunks. This is generally useful to increase pruning speed.

:bulb: You should test the best number of threads for your connection and storage provider but using more than 30 threads is unadvised as it will not improve speeds significantly.

Example

duplicacy prune -keep 1:7 -threads 10 # use 10 threads for the pruning process

Notes

:bulb: Revivions to be deleted can be specified by numbers, by a tag, by retention policies, or by any combination of these categories.

:bulb: Only one repository should run prune

Since :d: encourages multiple repositories backing up to the same storage (so that deduplication will be efficient), users might want to run prune from each different repository.

The design of :d: however was based on the assumption that only one instance would run the prune command (using -all). This can greatly simplify the implementation.

It also is a bit wasting the resources to have a prune command working on one repository id only, since it still needs to download all backups for all other repository ids in order to decide which chunks are to be deleted.

Finally, in theory race conditions can happen when two instances try to operate on the same chunk at the same time, but in practice it may never happen especially if the prune command runs after the backup so they will start at random times.

:bulb: Pruning is logged

All prune actions are logged by default locally, on the machine where the prune command is executed, under .duplicacy/logs. The prune logs are named similarly to prune-log-20171230-142510.

In the same folder you will also find log files which are empty. There is no need to worry if the files are empty as this means that in that particular prune operation, nothing was pruned from the storage.

:bulb: `-exhaustive` should be used sparingly

The -exhaustive option is only needed when there are known unreferenced chunks in the storage, for example, when a backup is interrupted by user and terminated due to an error and the files in the repository change afterwards.

It is not recommended to run the prune command regularly with this option without a recent incomplete backup, mainly because if there is an ongoing backup from a different computer, the prune command will mark as fossils all new chunks uploaded by that backup.

Although in the fossil deletion step the prune command can correctly identify that these chunks are actually referenced and thus turn them back into chunks, the cost of extra API calls can be excessive.

:bulb: The last revision can only be deleted in `-exclusive` mode

The latest revision from each repository can’t be deleted in non-exclusive mode because in theory it is possible that a backup for that repository may be in progress which will use the latest revision as the base, so removal of the latest revision would cause some chunks to be removed even though they are needed by the backup in progress.

:warning: Corner cases when prune may delete too much

There are two corner cases that a fossil still needed may be mistakenly deleted. When there is a backup taking more than 7 days that started before the chunk was marked as fossil, then the prune command will think the repository has become inactive which will then be excluded from the criteria for determining safe fossils to be deleted.

The other case happens when an initial backup from a newly recreated repository that also started before the chunk was marked as fossil. Since the prune command doesn't know the existence of such a repository at the fossil deletion time, it may think the fossil isn't needed any more by any backup and thus delete it permanently.

Therefore, a check command must be used if a backup is an initial backup or takes more than 7 days. Once a backup passes the check command, it is guaranteed that it won't be affected by any future prune operations.

:bulb: Individual files cannot be pruned

Note that duplicacy always prunes entire revisions of entire snapshots, not of individual files. In other words: it is not possible to remove backups of specific files from the storage. This means, for example, if you realize after a couple of months, that you have accidentally been backing up some huge useless files, the only way to remove them from the storage to free up space is to prune each and every revision in which they are included.

Two-step fossil collection algorithm

The prune command implements the two-step fossil collection algorithm. It will first find fossil collection files from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil collection file stored locally (the fossil collection step).

For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least one new snapshot from each snapshot id must be created between two runs of the prune command. However, some repository may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days, and you can also use the -ignore option to skip certain repositories when deciding the deletion criteria.

leftytennis commented 3 months ago

This PR adds a -keep-max option to the prune command and allows you to specify the max number of snapshots to keep matching the prune tag specified.

Quick overview

NAME:
   duplicacy prune - Prune revisions by number, tag, or retention policy

USAGE:
   duplicacy prune [command options]

OPTIONS:
   -id <snapshot id>            delete revisions with the specified snapshot ID instead of the default one
   -all, -a                     match against all snapshot IDs
   -r <revision> [+]            delete the specified revisions
   -t <tag> [+]                 delete revisions with the specified tags
   -keep <n:m> [+]              keep 1 revision every n days for revisions older than m days
   -keep-max <n>                keep max n most recent revisions matching the tag -t
   -exhaustive                  remove all unreferenced chunks (not just those referenced by deleted snapshots)
   -exclusive                   assume exclusive access to the storage (disable two-step fossil collection)
   -dry-run, -d                 show what would have been deleted
   -delete-only                 delete fossils previously collected (if deletable) and don't collect fossils
   -collect-only                identify and collect fossils, but don't delete fossils previously collected
   -ignore <id> [+]             ignore revisions with the specified snapshot ID when deciding if fossils can be deleted
   -storage <storage name>      prune revisions from the specified storage
   -threads <n>                 number of threads used to prune unreferenced chunks

Options

`-keep-max <n>`

Keep max n most recent revisions for the specified tag -t.

A single tag -t must be specified when using -keep-max. The -keep-max and -keep options are mutually exclusive.

Examples:

duplicacy prune -keep-max 24 -t hourly    # Keep 24 most recent revisions with tag hourly
duplicacy prune -keep-max  7 -t daily     # Keep  7 most recent revisions with tag daily
duplicacy prune -keep-max  4 -t weekly    # Keep  4 most recent revisions with tag weekly
duplicacy prune -keep-max  3 -t monthly   # Keep  3 most recent revisions with tag monthly
duplicacy prune -keep-max  4 -t quarterly # Keep  4 most recent revisions with tag quarterly
duplicacy prune -keep-max  3 -t yearly    # Keep  3 most recent revisions with tag yearly

The -keep-max option must specify a number >= 0. If n is 0, all revisions matching the tag will be pruned. If n is greater than 0, then n of the most recent snapshot revisions will be kept.

leftytennis commented 3 months ago

I have tested the prune -keep-max parameter and it's working the way I intended.

A couple of comments:

I updated duplicacy_snapshotmanager_test.go so pass a keep_max = -1, which means that keep_max is not enabled for any of the PruneSnapshot calls made from the test functions. I don't know how to test that since I don't know how to use go test, nor did I add any new test functions to duplicacy_snapshotmanager_test.go, but I'd be happy to do so if you can nudge me in the right direction.
I added a function MinInt in duplicacy_utils.go that returns the minimum int value of the two ints passed to it. I'd found math.Min that only works with floats, which I could easily cast, but it requires go 1.21 and duplicacy go.mod seems to required 1.19. I did actually change that to 1.21 and successfully built and tested, but I didn't include that in the PR for fear of breaking something else.
I elected to allow -keep-max 0, which would allow all snapshot revisions matching a tag to be pruned, but that could easily be changed if you deem that too "dangerous". Since you already allow -keep 0:m, I decided to leave it in that way.
I added a full, future wiki-page update to the description of this PR. Last commits I had for duplicacy was years ago and I seem to recall the docs were part of the repository. Didn't see a way to include wiki updates in the PR, but if/when you merge, I'll update the wiki page for prune, etc.

Please review and let me know if you have comments/feedback for anything you'd like to see changed.

leftytennis commented 1 month ago

Closing due to not being accepted and I have implemented this differently in my forked repository... The new functionality is --keep-days, which allows you to specify how many days to keep snapshots... When combined with a tag, can be useful... i.e. --keep-days 1 -t hourly

gilbertchen / duplicacy

Add -keep-max option to the prune command #669

Quick overview

Usage

Options

-id <snapshot id>

Example:

-all, -a

Example:

-r <revision> [+]

Examples:

-t <tag> [+]

-keep <n:m> [+]

Examples:

-keep-max <n>

Examples:

-exhaustive

Example:

-exclusive

Example:

-dry-run, -d

Example:

-delete-only

Example:

-collect-only

Example:

-ignore <id> [+]

-storage <storage name>

Example:

-threads <n>

Example

Notes

:bulb: Only one repository should run prune

:bulb: Pruning is logged

:bulb: -exhaustive should be used sparingly

:bulb: The last revision can only be deleted in -exclusive mode

:warning: Corner cases when prune may delete too much

:bulb: Individual files cannot be pruned

Two-step fossil collection algorithm

Quick overview

Options

-keep-max <n>

Examples:

`-id <snapshot id>`

`-all, -a`

`-r <revision> [+]`

`-t <tag> [+]`

`-keep <n:m> [+]`

`-keep-max <n>`

`-exhaustive`

`-exclusive`

`-dry-run, -d`

`-delete-only`

`-collect-only`

`-ignore <id> [+]`

`-storage <storage name>`

`-threads <n>`

:bulb: `-exhaustive` should be used sparingly

:bulb: The last revision can only be deleted in `-exclusive` mode

`-keep-max <n>`