Snapshot retention policy (instances and storage volumes)

amartin3225 commented 1 year ago

With #9 making significant changes to the structure of the inc snapshot command (and there being more of a possibility of breaking backwards compatibility), I'd like to suggest a feature to increase snapshot retention granularity. The following use cases warrant maintaining a history of snapshots, but not at the same duration or granularity as the server where the incus container/VM is actually running:

Backup Server: you may want to keep a variety of snapshots going back several months (or even years) but you need a variable level of granuarlity as you go farther back in time (or storage space and the sheer number of snapshots will become unmanagable). For example, let's say you take snapshots every hour. You may want a lot of granularity in recent snapshots, say the last 10 hourly snapshots, but then you only need 2 snapshots from last week, 1 snapshot from last month, and 1 from last year. Sanoid lets you configure this type of variable-granularity-based-on-timescale and it would be great if we could do this with incus too
Warm Spare: it's desirable to maintain a warm spare server with a copy of the incus container/VM data: in case the active server goes down, the guest can quickly be brought back online on the warm spare. This is discussed briefly here, but utilizing syncoid and ZFS snapshots to only send the updated delta means that keeping the warm spare up-to-date is easy and doesn't require re-syncing the whole dataset. I think incus supports copying the guest with the --refresh flag, but again you have the problem of not being able to maintain a separate snapshot retention policy on the warm spare

Sanoid now supports pre-snapshot, post-snapshot, and post-prune hooks so if incus had a command like incus snapshot {create,delete} --metadata-only --snapshot-name="xxx" to only create/delete the database entry but not actually try to zfs snapshot/zfs destroy the snapshot, this would work (you could let sanoid be in charge of creating and deleting the actual ZFS snapshots).

Is it possible to add the above type of enhanced functionality for snapshots to incus, or provide some type of integration with 3rd-party tools like sanoid/syncoid?

benginow commented 5 months ago

Hi! I'm a UT student working on a group project and would love to take on this issue if it’s still available

stgraber commented 5 months ago

Hey @benginow,

This issue is currently marked as a maybe as it's not clear whether what's requested here can actually be made to fit Incus' design.

So I wouldn't recommend starting with this particular issue :)

amartin3225 commented 5 months ago

I'd be happy to discuss this in more detail if there's anything about this request that could/should be tweaked to make it align better with Incus' design

stgraber commented 5 months ago

So in general, I'm definitely not in favor of adding support for creating metadata-only snapshots as without something immediately filling them it effectively leads to a broken database state which may prevent Incus from performing needed data transitions on update or even routine background tasks that span all instances and their snapshots.

If you absolutely want to go down that path, your best bet is to basically do:

incus snapshot create INSTANCE SNAPSHOTNAME
zfs destroy POOL/containers/INSTANCE@SNAPSHOTNAME

That should result in the same thing but without us having to actually support it ;)

Snapshot retention is also a bit tricky because our fundamental design is based around each snapshot having an expiry date. That's a bit different from backup systems where that's often not true and instead rely on an overall snapshot retention policy to consider the current snapshot set and trim it based on the policy.

It'd be interesting to figure out if given say:

snapshot.retention.day=4 (keep max 4 snapshots for the past 24h period)
snapshot.retention.week=3 (keep max 3 snapshots for the past 7 days period, excluding current day)
snapshot.retention.month=4 (keep max 4 snapshots for the past month, excluding current week)
snapshot.retention.year=10 (keep max 10 snapshots for the past year, excluding current month)

It would be possible to already assign an expiry date on the snapshot at the time it's created by basically looking at the available snapshots at the time a new snapshot is created to determine how long it should be kept overall.

If that's possible to do and leads to a useful result, then I think we could implement that.

If not, then the best option would be for 3rd party backup tools that want to manage snapshots and backups on Incus to trigger expiry-less snapshots themselves and then delete them based on their own policy.

amartin3225 commented 5 months ago

It'd be interesting to figure out if given say:

snapshot.retention.day=4 (keep max 4 snapshots for the past 24h period)
snapshot.retention.week=3 (keep max 3 snapshots for the past 7 days period, excluding current day)
snapshot.retention.month=4 (keep max 4 snapshots for the past month, excluding current week)
snapshot.retention.year=10 (keep max 10 snapshots for the past year, excluding current month)
It would be possible to already assign an expiry date on the snapshot at the time it's created by basically looking at the available snapshots at the time a new snapshot is created to determine how long it should be kept overall.

I think this would work. The only case this doesn't cover is differing retention policies across servers. For example, if you host a container with a busy database, you might not want to keep any snapshots from more than a couple days ago on the running Incus server itself due to the size of the snapshots, but the backup server (which may have more space), could keep additional snapshots. Would these snapshot.retention.<period> attributes be set per instance (and be replicated with incus copy --refresh to other Incus servers) or globally per Incus server?

If not, then the best option would be for 3rd party backup tools that want to manage snapshots and backups on Incus to trigger expiry-less snapshots themselves and then delete them based on their own policy.

I think the problem with this route, at least historically, is if you delete the ZFS snapshot out from under Incus (e.g. sanoid cleans up a snapshot and then tries to run incus snapshot delete as a post-prune hook to clean up the metadata), Incus fails with an error (understandably since it can't find the snapshot it expects to find in the zpool). I suppose an alternative way to handle this would be with some type of incus snapshot clean command that would check all snapshots and delete any from the database which don't have a corresponding snapshot in ZFS anymore (e.g. something else modified/deleted it). That would allow a 3rd-party tool to modify the snapshots and then let Incus "catch up" with those changes.

stgraber commented 5 months ago

Those proposed config keys would be instance settings, you could make them apply to multiple instances through profiles.

For the --refresh case with a remote server, differing retention settings based on what I proposed above will not really work since --refresh will overwrite the instance config to match the source, though using a profile would avoid that issue. But more importantly because those config keys will just be used to calculate the correct snapshot expiry, a different policy on the target won't have any effect.

One thing that I think we could do is incus copy --refresh --new-snapshots-only or something along those lines, so only transferring the newer snapshots and not backfill anything the target may have deleted. Again, that's of limited use since the expiry of the snapshots are pre-calculated, so you'd still need something to trim snapshots after the transfer, but at least you'd only need to trim the new ones and not constantly battle the old ones being copied again.

I'd consider incus snapshot delete INSTANCE SNAP failing due to the snapshot already being gone as a bit of a bug and something we could fix (assuming we can accurately differentiate an error about a missing snapshot from another more important error).

amartin3225 commented 5 months ago

I'd consider incus snapshot delete INSTANCE SNAP failing due to the snapshot already being gone as a bit of a bug and something we could fix (assuming we can accurately differentiate an error about a missing snapshot from another more important error).

This would be great if possible (and useful even irrespective of this feature request).

The proposed instance config keys along with incus copy --refresh --new-snapshots-only implements most of this desired capability (except for the ability to have a different retention on a backup server), so I think it's sufficient to meet this need.

stgraber commented 5 months ago

Updated the issue to focus on those proposed additional config keys for both instances and storage volumes:

snapshot.retention.day=4 (keep max 4 snapshots for the past 24h period)
snapshot.retention.week=3 (keep max 3 snapshots for the past 7 days period, excluding current day)
snapshot.retention.month=4 (keep max 4 snapshots for the past month, excluding current week)
snapshot.retention.year=10 (keep max 10 snapshots for the past year, excluding current month)

As mentioned above, those will effectively have to conflict with snapshot.expiry and will be used to determine an ultimate snapshot expiry date at the time of snapshot creation based on pre-existing snapshots on the instance or storage volume.

stgraber commented 5 months ago

I've filed https://github.com/lxc/incus/issues/745 for the other part of this issue.

The last thing mentioned here which are issues with deleting snapshots from Incus when the underlying storage-level snapshot is gone already should be filed separately if that still occurs today.

amartin3225 commented 4 months ago

Thank you!

The last thing mentioned here which are issues with deleting snapshots from Incus when the underlying storage-level snapshot is gone already should be filed separately if that still occurs today.

I just tested and am unable to recreate this issue on Incus 6.0 LTS so no further action is needed for this.

nfournil commented 3 months ago

Nice feature ! I'm doing this via cron script every day ! I had a nice to have feature for this retention policy : after delete account a customer has a legal delay to ask for their data so we must keep these data for several months. In Ceph deleting a volume on a mirror system causes a delete on the destination mirror (and stop mirroring is deleting image also). incus delete image by default. It might be nice to have a switch to move to trash policy on delete. Trash in ceph has a policy to remove from trash after XXX days (so it's perfect !). Difference is : "rbd image rm" ... became "rbd trash mv ... " .

stgraber commented 3 months ago

Yeah, we may be able to have the Ceph driver do that, we'd just need to validate Ceph's behavior when a name gets reused, basically on how to name the trash entries so conflicts can be avoided.

lxc / incus

Snapshot retention policy (instances and storage volumes) #51