Closed leftytennis closed 1 month ago
This PR adds a -keep-max
option to the prune command and allows you to specify the max number of snapshots to keep matching the prune tag specified.
NAME:
duplicacy prune - Prune revisions by number, tag, or retention policy
USAGE:
duplicacy prune [command options]
OPTIONS:
-id <snapshot id> delete revisions with the specified snapshot ID instead of the default one
-all, -a match against all snapshot IDs
-r <revision> [+] delete the specified revisions
-t <tag> [+] delete revisions with the specified tags
-keep <n:m> [+] keep 1 revision every n days for revisions older than m days
-keep-max <n> keep max n most recent revisions matching the tag -t
-exhaustive remove all unreferenced chunks (not just those referenced by deleted snapshots)
-exclusive assume exclusive access to the storage (disable two-step fossil collection)
-dry-run, -d show what would have been deleted
-delete-only delete fossils previously collected (if deletable) and don't collect fossils
-collect-only identify and collect fossils, but don't delete fossils previously collected
-ignore <id> [+] ignore revisions with the specified snapshot ID when deciding if fossils can be deleted
-storage <storage name> prune revisions from the specified storage
-threads <n> number of threads used to prune unreferenced chunks
-keep-max <n>
Keep max n
most recent revisions for the specified tag -t
.
A single tag -t
must be specified when using -keep-max
. The -keep-max
and -keep
options are mutually exclusive.
duplicacy prune -keep-max 24 -t hourly # Keep 24 most recent revisions with tag hourly
duplicacy prune -keep-max 7 -t daily # Keep 7 most recent revisions with tag daily
duplicacy prune -keep-max 4 -t weekly # Keep 4 most recent revisions with tag weekly
duplicacy prune -keep-max 3 -t monthly # Keep 3 most recent revisions with tag monthly
duplicacy prune -keep-max 4 -t quarterly # Keep 4 most recent revisions with tag quarterly
duplicacy prune -keep-max 3 -t yearly # Keep 3 most recent revisions with tag yearly
The -keep-max
option must specify a number >= 0. If n
is 0, all revisions matching the tag will be pruned. If n
is greater than 0, then n
of the most recent snapshot revisions will be kept.
I have tested the prune -keep-max parameter and it's working the way I intended.
A couple of comments:
I updated duplicacy_snapshotmanager_test.go so pass a keep_max = -1, which means that keep_max is not enabled for any of the PruneSnapshot calls made from the test functions. I don't know how to test that since I don't know how to use go test, nor did I add any new test functions to duplicacy_snapshotmanager_test.go, but I'd be happy to do so if you can nudge me in the right direction.
I added a function MinInt in duplicacy_utils.go that returns the minimum int value of the two ints passed to it. I'd found math.Min that only works with floats, which I could easily cast, but it requires go 1.21 and duplicacy go.mod seems to required 1.19. I did actually change that to 1.21 and successfully built and tested, but I didn't include that in the PR for fear of breaking something else.
I elected to allow -keep-max 0, which would allow all snapshot revisions matching a tag to be pruned, but that could easily be changed if you deem that too "dangerous". Since you already allow -keep 0:m, I decided to leave it in that way.
I added a full, future wiki-page update to the description of this PR. Last commits I had for duplicacy was years ago and I seem to recall the docs were part of the repository. Didn't see a way to include wiki updates in the PR, but if/when you merge, I'll update the wiki page for prune, etc.
Please review and let me know if you have comments/feedback for anything you'd like to see changed.
Closing due to not being accepted and I have implemented this differently in my forked repository... The new functionality is --keep-days, which allows you to specify how many days to keep snapshots... When combined with a tag, can be useful... i.e. --keep-days 1 -t hourly
The
prune
command has the task of deleting old/unwanted revisions and unused chunks from a storage.Click here for a list of related forum topics.
Quick overview
Usage
duplicacy prune [command options]
Options
Options marked with [+] can be passed more than once.
-id <snapshot id>
Delete revisions with the specified snapshot ID instead of the default one.
Example:
-all, -a
Run the prune command against all snapshot IDs in selected storage.
Example:
-r <revision> [+]
Delete the specified revisions.
Examples:
-t <tag> [+]
Delete revisions with the specified tags.
-keep <n:m> [+]
Keep 1 revision every n days for revisions older than m days.
The retention policies are specified by the
-keep
option, which accepts an argument in the form of two numbersn:m
, wheren
indicates the number of days between two consecutive revisions to keep, andm
means that the policy only applies to revisions at leastm
day old. Ifn
is zero, any revisions older thanm
days will be removed. The-keep
and-keep-max
options are mutually exclusive.Examples:
Multiple
-keep
options must be sorted by theirm
values in decreasing order.For example, to combine the above policies into one line, it would become:
-keep-max <n>
Keep max
n
most recent revisions for the specified tag-t
.A single tag
-t
must be specified when using-keep-max
. The-keep-max
and-keep
options are mutually exclusive.Examples:
The
-keep-max
option must specify a number >= 0. Ifn
is 0, all revisions matching the tag will be pruned. Ifn
is greater than 0, thenn
of the most recent snapshot revisions will be kept.-exhaustive
Remove all unreferenced chunks (not just those referenced by deleted revisions).
The
-exhaustive
option will scan the list of all chunks in the storage, therefore it will find not only unreferenced chunks from deleted revivions, but also chunks that become unreferenced for other reasons, such as those from an incomplete backup.It will also find any file that does not look like a chunk file.
In contrast, a normal
prune
command will only identify chunks referenced by deleted revisions but not any other revisions.Example:
-exclusive
Assume exclusive access to the storage (disable two-step fossil collection).
The
-exclusive
option will assume that no other clients are accessing the storage, effectively disabling the two-step fossil collection algorithm.With this option, the
prune
command will immediately remove unreferenced chunks.WARNING: Only run
-exclusive
when you are sure that no other backup is running, on any other device or repository.Example:
-dry-run, -d
This option is used to test what changes the
prune
command would have done. It is guaranteed not to make any changes on the storage, not even creating the local fossil collection file.Example:
After running this nothing will be modified in the storage, but duplicacy will show all output just like a normal run:
-delete-only
Delete fossils previously collected (if deletable) and don't collect fossils.
Example:
-collect-only
Identify and collect fossils, but don't delete fossils previously collected.
Example:
The
-delete-only
option will skip the fossil collection step, while the-collect-only
option will skip the fossil deletion step.-ignore <id> [+]
Ignore revisions with the specified snapshot ID when deciding if fossils can be deleted.
-storage <storage name>
Prune revisions from the specified storage instead of the default one.
Example:
-threads <n>
This option is used to specify more than one thread to prune chunks. This is generally useful to increase pruning speed.
:bulb: You should test the best number of threads for your connection and storage provider but using more than 30 threads is unadvised as it will not improve speeds significantly.
Example
duplicacy prune -keep 1:7 -threads 10 # use 10 threads for the pruning process
Notes
:bulb: Revivions to be deleted can be specified by numbers, by a tag, by retention policies, or by any combination of these categories.
:bulb: Only one repository should run prune
Since :d: encourages multiple repositories backing up to the same storage (so that deduplication will be efficient), users might want to run prune from each different repository.
The design of :d: however was based on the assumption that only one instance would run the prune command (using
-all
). This can greatly simplify the implementation.It also is a bit wasting the resources to have a prune command working on one repository id only, since it still needs to download all backups for all other repository ids in order to decide which chunks are to be deleted.
Finally, in theory race conditions can happen when two instances try to operate on the same chunk at the same time, but in practice it may never happen especially if the prune command runs after the backup so they will start at random times.
:bulb: Pruning is logged
All prune actions are logged by default locally, on the machine where the prune command is executed, under
.duplicacy/logs
. The prune logs are named similarly toprune-log-20171230-142510
.In the same folder you will also find log files which are empty. There is no need to worry if the files are empty as this means that in that particular prune operation, nothing was pruned from the storage.
:bulb:
-exhaustive
should be used sparinglyThe
-exhaustive
option is only needed when there are known unreferenced chunks in the storage, for example, when a backup is interrupted by user and terminated due to an error and the files in the repository change afterwards.It is not recommended to run the prune command regularly with this option without a recent incomplete backup, mainly because if there is an ongoing backup from a different computer, the prune command will mark as fossils all new chunks uploaded by that backup.
Although in the fossil deletion step the prune command can correctly identify that these chunks are actually referenced and thus turn them back into chunks, the cost of extra API calls can be excessive.
:bulb: The last revision can only be deleted in
-exclusive
modeThe latest revision from each repository can’t be deleted in non-exclusive mode because in theory it is possible that a backup for that repository may be in progress which will use the latest revision as the base, so removal of the latest revision would cause some chunks to be removed even though they are needed by the backup in progress.
:warning: Corner cases when prune may delete too much
There are two corner cases that a fossil still needed may be mistakenly deleted. When there is a backup taking more than 7 days that started before the chunk was marked as fossil, then the prune command will think the repository has become inactive which will then be excluded from the criteria for determining safe fossils to be deleted.
The other case happens when an initial backup from a newly recreated repository that also started before the chunk was marked as fossil. Since the prune command doesn't know the existence of such a repository at the fossil deletion time, it may think the fossil isn't needed any more by any backup and thus delete it permanently.
Therefore, a check command must be used if a backup is an initial backup or takes more than 7 days. Once a backup passes the check command, it is guaranteed that it won't be affected by any future prune operations.
:bulb: Individual files cannot be pruned
Note that duplicacy always prunes entire revisions of entire snapshots, not of individual files. In other words: it is not possible to remove backups of specific files from the storage. This means, for example, if you realize after a couple of months, that you have accidentally been backing up some huge useless files, the only way to remove them from the storage to free up space is to prune each and every revision in which they are included.
Two-step fossil collection algorithm
The
prune
command implements the two-step fossil collection algorithm. It will first find fossil collection files from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil collection file stored locally (the fossil collection step).For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least one new snapshot from each snapshot id must be created between two runs of the prune command. However, some repository may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days, and you can also use the
-ignore
option to skip certain repositories when deciding the deletion criteria.