Currently as a FYI caveat, maybe the best-shot solution is to document this:
By design of partially-recursive snapshots (e.g. we want rpool except rpool/swap) the znapzend logic actually creates a recursive snapshot of the dataset which locally defines a znapzend schedule, and then goes over its sub-datasets to remove the just-created snapshots (only them, by known name) wherever enabled=off. This leaves a time gap for the host to crash/reboot/etc. leaving unintended snapshots in place.
Beside potentially hogging space on datasets with high data turnover, this also causes messages like:
Dec 5 14:33:18 ci-oi znapzend[17860]: [ID 702911 daemon.warning] ERROR: suspending cleanup source dataset rpool/export because 1 send task(s) failed:
Dec 5 14:33:18 ci-oi znapzend[17860]: [ID 702911 daemon.warning] +--> ERROR: snapshot(s) exist on destination, but no common found on source and destination: clean up destination znapzend:pond/export/DUMP/ci-oi/rpool/export/home/builder/.ccache (i.e. destroy existing snapshots)
The error emitted is because local rpool/export/home/builder/.ccache had a znapzend-made snapshot and so was a candidate for sending off-site while it should not have been considered at all.
On one hand this message allows to notice the problem at all (if someone sometimes looks at dmesg), on another this causes space-hogging in nearby datasets (e.g. all of rpool/export children for post above) as well as requiring the system to track many more snapshots (their minuscule overheads do add up en-masse), because for safety reasons these frequent snapshots are not weeded out according to retention schedule in configuration (e.g. drop from 30 minutes for a few recent hours to once a week for a year).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Currently as a FYI caveat, maybe the best-shot solution is to document this:
By design of partially-recursive snapshots (e.g. we want
rpool
exceptrpool/swap
) theznapzend
logic actually creates a recursive snapshot of the dataset which locally defines aznapzend
schedule, and then goes over its sub-datasets to remove the just-created snapshots (only them, by known name) whereverenabled=off
. This leaves a time gap for the host to crash/reboot/etc. leaving unintended snapshots in place.Beside potentially hogging space on datasets with high data turnover, this also causes messages like:
The error emitted is because local
rpool/export/home/builder/.ccache
had a znapzend-made snapshot and so was a candidate for sending off-site while it should not have been considered at all.On one hand this message allows to notice the problem at all (if someone sometimes looks at
dmesg
), on another this causes space-hogging in nearby datasets (e.g. all ofrpool/export
children for post above) as well as requiring the system to track many more snapshots (their minuscule overheads do add up en-masse), because for safety reasons these frequent snapshots are not weeded out according to retention schedule in configuration (e.g. drop from 30 minutes for a few recent hours to once a week for a year).