More flexible handling of offline destinations

snyman commented 5 years ago

My use case is to have two backup destinations which I alternate between being local and being offline and offsite (i.e. I pull the "off" backup disks and store them offsite). With this setup, both destinations are almost never online at the same time (when I'm switching I pull the live one, take it to the offsite location, swap it for the alternate and bring that back to install). Because of that, the original behaviour of never cleaning up source snapshots if any destination failed would cause the source snapshots to never get cleaned up, which is not ideal.

This changes that by marking the last sent snapshot on every successful send, ensuring that the most recent sent snapshot for each destination is never deleted, and finally optionally cleaning up the source snapshots. So unless the destination changes unexpectedly, there should still be a common snapshot to synchronize incrementally from.

The cleaning of source snapshots is turned on by an option because the original behaviour is probably safer in general, and so should be the default for use cases where offline destinations are not usual conditions. However, marking the sent snapshots and ensuring the most recent one is not deleted seems like potentially useful behaviour in general, so they are always on. I also fixed an inconsistency with how deleting destination snapshots work in the presence of failures (the behaviour differed depending on the order the destinations were synced in).

BTW, this is the first time I've done anything in Perl, so I'm sorry in advance if I've made any egregious mistakes due to that.

coveralls commented 5 years ago

Coverage increased (+0.5%) to 87.883% when pulling 1af99ababbf512cf7a6244435ee8fb0cf492043f on snyman:master into e9d65e62b05cf8e38987d52ab2ef909f033413d8 on oetiker:master.

JMoVS commented 5 years ago

I think this PR would be the perfect opportunity to not only tag the last synced snapshot locally but also to bookmark it. That way, it'd be practically risk-free to clean-up the source according to the normal plan and still be able to sync up to the destination simply by using (maybe even always) the marked bookmark as the start of the incremental send.

oetiker commented 5 years ago

@JMoVS note that a local snapshot can be synced to multiple destinations ...

JMoVS commented 5 years ago

@oetiker Yep, that's why one could simply name the local snapshots bookmark after the destination.

Is it possibke to create more than 1 bookmark for one single snapshot?

jimklimov commented 4 years ago

As a side comment, note that ZFS bookmarks, holds, etc. may not be available on systems that run znapzend due to, for whatever reason, old zpool/zfs versions on-disk, and/or older kernels running them. So at least some error-checking (and resilience) is due for addition of tagging of sent snapshots.

A possible alternative could be to try adding the tags via ZFS dataset properties set on snapshots, this should be possible with as old versions of ZFS as I could check (circa Solaris 10u8 or maybe even before).

hansoostendorp commented 4 years ago

I ran into the same problem today. I setup a source dataset to sync with two destination datasets, one local, one offsite. I keep snapshots on the source dataset every minute for 1 hour. I shutdown the offsite backup server yesterday, and turned it on today.

I noticed the snapshots on the source weren't cleaned up. I had 1500+ snapshots. All these 1500 snapshots are now syncing to the offsite backup destination, to be deleted again. This takes a lot of time and could create a problem. I don't have a guarantee the offsite server is always available.

This PR would solve that problem! I really like the solution, looks solid. This was also mentioned in #425

I don't know why this PR didn't make it into a release. Is there anything I can do to help?

oetiker commented 4 years ago

you are right this one sort of fell under the carpet

oetiker commented 4 years ago

Unfortunately the code base has diverged a bit .. anyone care fix the conflicts ?

jimklimov commented 4 years ago

Taking a look, since it most diverges with what I touched recently :)

jimklimov commented 4 years ago

My initial reading (even today, and in comments posted months ago above) was that this PR would record the name of the last-synced snapshot for a particular destination policy in the (root) source dataset, using an attribute named like dst_N_synced=snapname.

While revising deeper now, I come to conclusion that instead it just sets a dst_N_synced=1 flag into each snapshot that is known synced, and while looking for common points with offline destinations searches from the newest snapshot down into history until it has a hit.

With the intention of not-deleting such snapshots being in place, this seems to be reasonable (and so we can clean older snapshots). However I wonder if there is a good case to also tag as synced any intermediate snapshots (if we had skipped scheduled updates and have now sent several increments)?

The logic at 906ac4f4cae4cd2bf49e7d0d8b9eeaf150a65cc2 seems to take care about my other nagging thought (so maybe other readers were confused too :) ), that a cleanup of snapshots for dst_X might kill the latest snapshot for dst_Y - I think this should not happen with that commit in place.

jimklimov commented 4 years ago

Thinking more about this solution, I think it would have an Achilles' foot with the "inherited" policy modes (e.g. --run-once=pool/coveredbypolicy/childdataset --inherited): at least on my system, while zfs get can retrieve attributes of snapshots and does not refuse -r nor -s inherited,local, it does not handle them in the same way as it can create or destroy snapshots recursively:

recursion flag in a zfs get -r ... ds@snap is effectively a no-op, returning same data as without -r
inherited attributes are returned, but only from "real" datasets higher in hierarchy (e.g. policy definition from pool/coveredbypolicy in example above) but not from "same-named" snapshots of such higher level datasets that set a custom attribute.

So to be safe, we'd have to iterate up to (grand...-)parent with policy (or generally to pool root) until we have a hit for -s inherited,local setting of dst_X_synced or until we have no such hit. Hopefully this would not be prohibitive on I/O side as it is nearly what ZFS itself does for "usual" datasets, and for repetitive requests data would come from cache.

I thought another solution could be to go after my first interpretation of this PR as noted above, to record the last-known destination snapshot name in the dataset properties (not a flag in snapshot itself) - this would be inherited, but it could also be messier to maintain specifically in view of --runonce --inherited. Probably not prohibitively messier though: going this way, after a successful sync of a higher-level dataset, up to the one with policy, one should take care to define its known last-synced snap names, and undefine any such attributes if set in its children. In fact, the revised code of this PR already sort of went in this direction when I had to handle somehow the block of recursive cleanup (so it should look at all children to see if they defined the flag locally for some snapshot name same as in the root of processed tree... though with the recursive lookup and zfs tooling I have, it would probably only match if a descendant non-snapshot dataset defined the attribute we seek).

Posted what I have currently as #506 ...

jimklimov commented 4 years ago

I think there is also a loophole that if a destination snapshot is cleaned up on a different schedule (which allows the source snapshot to survive for now), the dst_X_synced flag is not cleared from same-named source snapshot; at the moment there even is no routine to deleteSnapshotProperties().

While normally such snapshot with no longer valid flag would not be the newest source snapshot made (and tagged) by znapzend, since destination and then source cleanup usually happens after successful replications, I feel there is room for misleading discrepancies...

jimklimov commented 4 years ago

I suppose this PR can now be closed - a variant of the code got to master (thanks @snyman !), and then some ;)

oetiker / znapzend

More flexible handling of offline destinations #384