Is there no support for recursive send (using `zfs send -R` replication streams)?

jimklimov commented 4 years ago

Looking at code around https://github.com/oetiker/znapzend/blob/66baf324d60526ab565844e11c363e5b39cb4c19/lib/ZnapZend/ZFS.pm#L339 (and at processes actually running on my system) I think znapzend only sends an incremental or full update for lastSnapshot of one dataset. So for a fairly zfs-treeish system with hundreds of datasets, it makes (eventually, not at once) hundreds of zfs send|mbuffer|(ssh|)zfs recv forks, each with its latency hit of zfs processes talking to kernel for several seconds or minutes to begin actual I/O (Related to #104 - similar use-case).

If we have a recursive = on setting for snapshots creation and cleaning, why not use it for sending? At least, if it does succeed, we are more quickly done. If it fails, we can re-evaluate which datasets and snapshots exist on both sides, and catch up one-by-one to get at the specific errors.

The #104 issue discussion goes further into the land of wanna-woulda, with different retention policies and so sending-schedules (as well as, in my earlier posted wishlist, an exclusion support - so e.g. I set up once the pool retention rules, but point-exclude the swap/dump/... datasets). Both of these directions call for more sophisticated schedule calculator than current all-or-nothing, but it is doable). Those features would be great, and with the scheduler mentioned above the computer could calculate the work scope needed (with e.g. differently-scheduled zfs send -R calls for different sub-trees with same settings under some point) rather than admins doing such grunt work, having recursive actions and different policies would not all contradict each other.

Note that zfs send -R comes with caveats - such as that receiving with zfs recv -F would remove snapshots and datasets absent on the source, so this flag should probably not be used in such cases and only reserved for certain cases (e.g. someone modified the destination dataset, listed with atime=on or some such -- do a single old sync of each such dataset with rollback as we do now to make incremental sends work again, and then massive replication send without rollbacks).

jimklimov commented 4 years ago

Thinking about it a bit more over the past week of digging in this code, a safe path forward seems to be:

evaluate whether the dataset with backup plan that we would now process (might be one of many discovered or handed-down backupSet array entries) is a root of a "straightforward" tree to back up completely (plan's "recursive" option is "on", and no descendant has an "enabled=off" setting... and maybe also that no descendant has a "source=local" backup plan of its own) * also consider --autoCreation=on (BTW... introduce a backup plan option for that, if needed? => #463 says it was done... later; using a send -R would create them too - but* beware that a zfs send -R ... | zfs recv -F would delete destination children currently missing on source) if some target datasets are missing; ignore this point if trees are equivalent
if this dataset is "straightforward", try to zfs send -R | mbuffer | zfs recv and let ZFS figure out what to accept and what to ignore; different vendors' and releases' builds of ZFS are differently capable in this regard (e.g. some might send the whole stream to be ignored in vain, while others might perhaps agree to skip it and begin sending another more quickly)
if the dataset is not "straightforward", or if it is but the zfs send -R attempt failed, fall back to what we do now, sending each individual dataset and each their enabled descendant, one by one

Hopefully this approach is backwards-compatible for complex configurations, but might benefit from simpler faster single-command-per-tree send/receive; in particular I hope that it would let the zfs recv half of the equation not to start hundreds of times each with its evaluation of the pools and stuff, and so would cut days-long runs of small but numerous backups into something a lot shorter.

oetiker commented 4 years ago

I like the idea to add a 'cut through' path for 'the normal case'. Not sure about the opportunistic receive ... it is backup, so I would like for znapzend to complain if things do not work out.

griznog commented 4 years ago

Looking forward to testing this as I currently have a number of filesystems, $HOME for example, where each user has a filesystem under the parent directory so that, for example, pool/home might have 500 or 600 (and growing) child filesystems in it. There's no interest in having or allowing any children distinct settings so even a simple option that just ignored any child settings and forced the -R attempt or failed would be perfect for this use case. Currently we have to keep backing off our snapshot/replication frequency as the number of users/directories grows. But we very much want to keep the individual user filesystems/snapshots as occasionally we have to go in and drop a persons snaps if they've done something that requires purging $HOME and want that activity confined to the single affected user.

jimklimov commented 3 years ago

As a stressing reminder to self, and just data-points: there is a use-case for dataset trees with some branches pruned via zfs set org.znapzend:enabled=off backupSet-with-policy/excludedChild and I've just checked that this is honored if I only define this one custom attribute in the child dataset:

### rpool/SHARED/var/mail is a backupSet with policy
:; zfs create rpool/SHARED/var/mail/test
:; zfs set org.znapzend:enabled=off rpool/SHARED/var/mail/test
:; znapzend --runonce=rpool/SHARED/var/mail

On destination I've got new snaps of the parent dataset, but no mention of the excluded child. (In debug log, the source dataset was recursively snapshot'ed and then for the excluded children that snapname was removed, as expected).

This is something that operations with recursive send might have to take into account. For first shot maybe bluntly - only do zfs send -R if there are no excluded children under some branch (or no children with a different/incompatible explicitly defined retention policy) - which may be not the backupset root but a deeper point in its sub-tree; and handle the more complicated cases with same approach as now, dataset by dataset.

Note: for future refinement, would be cool to detect "compatible" different explicit schedules in child datasets, where sending them recursively along with their parent (and maybe yes maybe not, cleaning afterwards) would be the right thing to do.

UPDATE: I checked what happens if I make a further child, .../test/sub which is enabled for replication. Currently this fails since there is no parent dataset on destination, if it is not made by admin manually, but otherwise it tries:

:; zfs create rpool/SHARED/var/mail/test/sub
:; zfs set org.znapzend:enabled=on rpool/SHARED/var/mail/test/sub
:; znapzend --runonce=rpool/SHARED/var/mail
...
# zfs send -I 'rpool/SHARED/var/mail@znapzend-auto-2020-09-13T10:13:25Z' 'rpool/SHARED/var/mail@znapzend-auto-2020-09-13T10:22:45Z'|/opt/csw/bin/amd64/mbuffer -q -s 256k -W 600 -m 1G|zfs recv -u -F naspool/snapshots/rpool/SHARED/var/mail
# zfs list -H -o name -t snapshot rpool/SHARED/var/mail@znapzend-auto-2020-09-13T10:22:45Z 
# zfs set org.znapzend:dst_0=naspool/snapshots/rpool/SHARED/var/mail rpool/SHARED/var/mail@znapzend-auto-2020-09-13T10:22:45Z
# zfs set org.znapzend:dst_0_synced=1 rpool/SHARED/var/mail@znapzend-auto-2020-09-13T10:22:45Z
[Sun Sep 13 14:23:09 2020] [debug] sending snapshots from rpool/SHARED/var/mail/test to naspool/snapshots/rpool/SHARED/var/mail/test
[Sun Sep 13 14:23:09 2020] [debug] Are we sending "--since"? since=="0", skipIntermediates=="0", forbidDestRollback=="0", justCreated=="false"
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/SHARED/var/mail/test
# zfs list -H -o name -t snapshot -s creation -d 1 naspool/snapshots/rpool/SHARED/var/mail/test
cannot open 'naspool/snapshots/rpool/SHARED/var/mail/test': dataset does not exist
[Sun Sep 13 14:23:09 2020] [debug] sending snapshots from rpool/SHARED/var/mail/test/sub to naspool/snapshots/rpool/SHARED/var/mail/test/sub
[Sun Sep 13 14:23:09 2020] [debug] Are we sending "--since"? since=="0", skipIntermediates=="0", forbidDestRollback=="0", justCreated=="false"
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/SHARED/var/mail/test/sub
# zfs list -H -o name -t snapshot -s creation -d 1 naspool/snapshots/rpool/SHARED/var/mail/test/sub
cannot open 'naspool/snapshots/rpool/SHARED/var/mail/test/sub': dataset does not exist
# zfs send 'rpool/SHARED/var/mail/test/sub@znapzend-auto-2020-09-13T10:22:45Z'|/opt/csw/bin/amd64/mbuffer -q -s 256k -W 600 -m 1G|zfs recv -u -F naspool/snapshots/rpool/SHARED/var/mail/test/sub
cannot create 'naspool/snapshots/rpool/SHARED/var/mail/test/sub@znapzend-auto-2020-09-13T10:22:45Z': parent does not exist
mbuffer: error: outputThread: error writing to <stdout> at offset 0x0: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
[Sun Sep 13 14:24:20 2020] [warn] ERROR: cannot send snapshots to naspool/snapshots/rpool/SHARED/var/mail/test/sub
[Sun Sep 13 14:24:20 2020] [warn] ERROR: suspending cleanup source dataset rpool/SHARED/var/mail because 1 send task(s) failed:
[Sun Sep 13 14:24:20 2020] [warn]  +-->   ERROR: cannot send snapshots to naspool/snapshots/rpool/SHARED/var/mail/test/sub
[Sun Sep 13 14:24:20 2020] [info] done with backupset rpool/SHARED/var/mail in 94 seconds
[Sun Sep 13 14:24:20 2020] [debug] send/receive worker for rpool/SHARED/var/mail done (9435)

After I created an empty naspool/snapshots/rpool/SHARED/var/mail/test manually, the next znapzend suceeded to send over the sub.

jimklimov commented 3 years ago

@griznog : Your use-case with individual homedir datasets makes sense (also for quota, reservations, access to snapshots via CIFS Shadow Volumes or TimeMachine, different ofher zfs settings).

I believe the goal you are after, to have the growing populace replicated automatically, should already be handled with --autoCreation parameter passed to the daemon in your (customized) service definition.

griznog commented 3 years ago

@jimklimov we do use --autoCreation and it works great, the issue is that replication of home requires 1000+ ssh connections, each of which takes time. From watching logs it's at least 5 - 10 seconds per filesystem. We get some speedup by using a ControlMaster setup in ssh, but still as the number of child filesystems goes up we have to adjust the replication interval to accommodate doing this serially.

jimklimov commented 3 years ago

@griznog : by my experience, there is a large lag of zfs recv looking through target pool's snapshot guids before it decides how to receive each new snapshot. It is relatively negligible overhead for large snapshots, but annoying one for empty ones still taking many seconds to get 324 bytes or so. And alas happens also for manual zfs send -R activities too, though maybe with a bit less overhead than for many individual sends, maybe not... I'm not sure anymore :-\

I wonder if implementing an opposite option, with configurably-parallel sends of child datasets of one configured backupSet schedule (extending the original mode where we zfs send stuff one by one anyway, and only parallelize diffferent schedules) could be a suitable approach - to do more bulk data transfers in any particular second and less time spent doing ONLY those guid lookups or whatever the receiver waits on initially...

griznog commented 3 years ago

We aren't very particular about how we get speedup, just hoping that we can get something faster 😄 A quick test of manual zfs send -R on our $HOME shows it is indeed not a silver bullet for this problem, it takes several minutes before it even starts sending any data.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

oetiker / znapzend

Is there no support for recursive send (using `zfs send -R` replication streams)? #438