Want a separate command for safe removal of unneeded overdue snapshots

jimklimov commented 5 years ago

While investigating the issue I have with znapzend sometimes refusing to do cleanup of old snaps and/or not managing to do it in time (when I/Os are slow), I thought it would be nice if there were a mode to just drop the older autosnaps safely - e.g. if they are beyond retention policy timeout and their disappearance would not preclude further incremental sync's... because the deepest original problem in the stack is the (source) system overflowing the pool with data referenced in snapshots that we expected to be long gone by the time we still see them. And manually killing off older snapshots sometimes did misfire, when no common ones remained between dst and src.

jimklimov commented 5 years ago

My primary question in this direction at the moment is, whether current code in the ZnapZend.pm sendRecvDestroy() suffices for this (so making the send/recv part optional - and enabled by default - would cut it), or would this new feature need a new way to discover which snapshots exist on all src/dst combos and are obsolete to be killed off? At least, to solve the original practical problem, find which src snaps we won't be sending anymore (some newer snaps are seen on all dst's)?..

oetiker commented 5 years ago

I would also assume that just running the cleanup step should be feasible

jimklimov commented 5 years ago

It seems so, but I am not sure without very deep digging in the logic if the cleanup step does not assume that send/recv happened before it, and succeeded, so all snaps are by definition in place. The start of send/recv routine has a good-looking test for whether there are compatible snaps in the destination... maybe it can be snatched and adapted into the "destroy-only" codepath.

Currently I'm busy with other work so background-processing this idea in general and inputs/corrections are welcome ;)

oetiker commented 5 years ago

yes, it can not know if it 'may' remove something without actual syncing ... note there is another patch which records the status of the syncing to work around this problem

https://github.com/oetiker/znapzend/pull/384

jimklimov commented 5 years ago

I wonder if it is easy to use the existing code to just tackle the original problem directly:

load lists of existing auto-snapshots on src and dst;
correlate to see and warn if there are anticipated problems (e.g. no common snaps would cause issues on resync - requiring to remove all snaps on DST or to rename and recreate that dataset, inaccessible DST, no space on SRC or DST pool or quota'ed dataset so making or receiving snaps is quickly known to "will have failed") - there are some use-cases repeating for us that cause the normal loop to fail;
starting back into the darker ages from the oldest snapshot among the newest common snapshots of SRC and DST#N (so the oldest snapshot that must remain existing in SRC), excluding this snapshot - evaluate if any of those older snaps may be removed from SRC according to retention policy. Likewise for each DST starting back from the newest common snapshot (excluding it).

Unless I've missed something, this should produce warnings about what could be blocking normal operations for the daemon, as well as a list of snapshots safe to remove almost as quickly as it would take to zfs list -o ... the root SRC/DST datasets involved (preferably including recursion and so reducing the amount of zfs callouts).

Does anything stick out as a problem in such approach?

jimklimov commented 5 years ago

As an added benefit, from today's experience, such "safe snapshot deletion" would especially help when the pool (or quota) are filled and no more snaps can be created. In this case, honest sending (and subsequent post-factum cleanup) is blocked from succeeding, even if a lot of space can be gained by dropping older unneeded snapshots. In fact, might make sense to start the big loop with such cleanup, when/if it is implemented and stable.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

oetiker / znapzend

Want a separate command for safe removal of unneeded overdue snapshots #389