oetiker / znapzend

zfs backup with remote capabilities and mbuffer integration.
www.znapzend.org
GNU General Public License v3.0
604 stars 136 forks source link

Design problem with "enabled=off, recursion=on" handling #597

Closed jimklimov closed 1 year ago

jimklimov commented 1 year ago

Currently as a FYI caveat, maybe the best-shot solution is to document this:

By design of partially-recursive snapshots (e.g. we want rpool except rpool/swap) the znapzend logic actually creates a recursive snapshot of the dataset which locally defines a znapzend schedule, and then goes over its sub-datasets to remove the just-created snapshots (only them, by known name) wherever enabled=off. This leaves a time gap for the host to crash/reboot/etc. leaving unintended snapshots in place.

Beside potentially hogging space on datasets with high data turnover, this also causes messages like:

Dec  5 14:33:18 ci-oi znapzend[17860]: [ID 702911 daemon.warning] ERROR: suspending cleanup source dataset rpool/export because 1 send task(s) failed:
Dec  5 14:33:18 ci-oi znapzend[17860]: [ID 702911 daemon.warning]  +-->   ERROR: snapshot(s) exist on destination, but no common found on source and destination: clean up destination znapzend:pond/export/DUMP/ci-oi/rpool/export/home/builder/.ccache (i.e. destroy existing snapshots)

The error emitted is because local rpool/export/home/builder/.ccache had a znapzend-made snapshot and so was a candidate for sending off-site while it should not have been considered at all.

On one hand this message allows to notice the problem at all (if someone sometimes looks at dmesg), on another this causes space-hogging in nearby datasets (e.g. all of rpool/export children for post above) as well as requiring the system to track many more snapshots (their minuscule overheads do add up en-masse), because for safety reasons these frequent snapshots are not weeded out according to retention schedule in configuration (e.g. drop from 30 minutes for a few recent hours to once a week for a year).

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.