Can speed up a lot with recursive "zfs destroy"

jimklimov commented 5 years ago

The behavior I see currently in practice, and in code, is that after the snapshots are made and sent, they are destroyed - as one huge list of arguments by default or one by one in oracleMode. Our setup uses an extensive tree of datasets with a recursive znapzend policy starting from a low-hanging branch of the pool, so there are typically thousands of snaps to delete and processing each one takes the host several seconds for all the synchronous contexts... so the loop just does not keep up doing the job, and we run out of space due to obsolete snapshots eating it up with referenced unneeded bits.

Note 1: if I use backgrounded zfs destroy commands from a loop and/or a recursive destroy, it takes as long to destroy hundreds of snapshot datasets as it takes to do one, so there ought to be some common big lock there effectively caching and coalescing these requests.

Note 2: in fact the processing of blocks getting freed up is background and asynchronous on even later Solaris 10 releases and illumos nowadays, and can take minutes after the zfs CLI commands have completed and returned. But getting those blocks onto the hit-list takes a considerable while.

I see that creation of snapshots takes the ZFS recursion support into account at https://github.com/oetiker/znapzend/blob/c604a86857430258c2b8479c356437c0f61a4dc6/lib/ZnapZend/ZFS.pm#L221 but removal does not seem to: https://github.com/oetiker/znapzend/blob/c604a86857430258c2b8479c356437c0f61a4dc6/lib/ZnapZend/ZFS.pm#L235

My suggestion is to start the cleanup phase with a quick recursive deletion of the specified snapshot name from the root branch with an individual local znapzend zetup, especially if we know that it was making recursive snapshots in the first place. Then we can follow up with the existing logic to find possibly missed snapshots in child datasets that should also be removed. Hopefully this latter part would usually have nothing to do.

I'll try to give this idea a run in our deployments before PRing, but comments are welcome in general :)

jimklimov commented 5 years ago

Seems the best place for the change is to first process $backupSet->{src} (rather than generalized $srcSubDataSets right away), using a destroySnapshots() extended with recursive flag support like used in createSnapshots(), around https://github.com/oetiker/znapzend/blob/4ccbe714186ac1fbc72a81e0548e6178279b8c76/lib/ZnapZend.pm#L354

Adding the recursion right into destroySnapshots() as a forced codepath would likely backfire - child datasets won't have the named (and already listed in sendRecvCleanup()) snaps to delete, so would call zfs in vain and waste resources and/or fail. I guess recursive deletion should be independent of one-by-one mode with a list based on survivors of recursive deletion.

oetiker commented 5 years ago

that sounds like a good aproach! looking forward to your PR

jimklimov commented 5 years ago

PR got to state where it seems to work for me and not produce surprises nor perl warnings, so feel free to test.

I made a stack of datasets and allowed my non-root user to play with those:

sudo zfs create rpool/export/test
sudo zfs create rpool/export/test/dst
sudo zfs create rpool/export/test/src
sudo zfs create rpool/export/test/src/child
sudo zfs allow -ldu jim clone,create,destroy,diff,mount,promote,rollback,snapshot,share,sharenfs,sharesmb,canmount,mountpoint,send,receive,mount,hold rpool/export/test

and made a setup for quick testing (every minute, little retention):

$ sudo ./bin/znapzendzetup create --recursive SRC '3min=>1min' rpool/export/test/src DST '1min=>1min' rpool/export/test/dst
*** backup plan: rpool/export/test/src ***
dst_0           = rpool/export/test/dst
dst_0_plan      = 1minute=>1minute
enabled         = on
mbuffer         = off
mbuffer_size    = 1G
post_znap_cmd   = off
pre_znap_cmd    = off
recursive       = on
src             = rpool/export/test/src
src_plan        = 3minutes=>1minute
tsformat        = %Y-%m-%d-%H%M%S
zend_delay      = 0

and bombarded it with

$  ./bin/znapzend -d --features=oracleMode --runonce rpool/export/test/src

and

$ ./bin/znapzend -d --runonce rpool/export/test/src

jimklimov commented 5 years ago

Example output :

jim@jimoo018:~/shared/znapzend$ ./bin/znapzend -d --features=oracleMode --runonce rpool/export/test/src
[Thu Oct 11 23:02:36 2018] [info] znapzend (PID=1743) starting up ...
[Thu Oct 11 23:02:36 2018] [info] refreshing backup plans...
[Thu Oct 11 23:02:36 2018] [info] found a valid backup plan for rpool/export/test/src...
[Thu Oct 11 23:02:36 2018] [info] znapzend (PID=1743) initialized -- resuming normal operations.
[Thu Oct 11 23:02:36 2018] [debug] snapshot worker for rpool/export/test/src spawned (1747)
[Thu Oct 11 23:02:36 2018] [info] creating recursive snapshot on rpool/export/test/src
# zfs snapshot -r rpool/export/test/src@2018-10-11-230236
[Thu Oct 11 23:02:36 2018] [info] checking ZFS dependent datasets from 'rpool/export/test/src' explicitely excluded
# zfs list -H -o name -t filesystem,volume
# zfs get -H -s local -o value org.znapzend:enabled rpool/export/test/src
# zfs get -H -s local -o value org.znapzend:enabled rpool/export/test/src/child
[Thu Oct 11 23:02:36 2018] [debug] snapshot worker for rpool/export/test/src done (1747)
[Thu Oct 11 23:02:36 2018] [debug] send/receive worker for rpool/export/test/src spawned (1752)
[Thu Oct 11 23:02:36 2018] [info] starting work on backupSet rpool/export/test/src
# zfs list -H -r -o name -t filesystem,volume rpool/export/test/src
[Thu Oct 11 23:02:36 2018] [debug] sending snapshots from rpool/export/test/src to rpool/export/test/dst
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/src
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/dst
# zfs send -I rpool/export/test/src@2018-10-11-225226 rpool/export/test/src@2018-10-11-230236|zfs recv -F rpool/export/test/dst
[Thu Oct 11 23:02:37 2018] [debug] sending snapshots from rpool/export/test/src/child to rpool/export/test/dst/child
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/src/child
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/dst/child
# zfs send -I rpool/export/test/src/child@2018-10-11-225226 rpool/export/test/src/child@2018-10-11-230236|zfs recv -F rpool/export/test/dst/child
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/dst
[Thu Oct 11 23:02:37 2018] [debug] cleaning up snapshots recursively under rpool/export/test/dst
# zfs destroy -r rpool/export/test/dst@2018-10-11-225208
# zfs destroy -r rpool/export/test/dst@2018-10-11-225226
[Thu Oct 11 23:02:37 2018] [debug] now will look if there is anything to clean in children of rpool/export/test/dst
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/dst/child
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/src
[Thu Oct 11 23:02:37 2018] [debug] cleaning up snapshots recursively under rpool/export/test/src
# zfs destroy -r rpool/export/test/src@2018-10-11-222250
# zfs destroy -r rpool/export/test/src@2018-10-11-222426
# zfs destroy -r rpool/export/test/src@2018-10-11-222505
# zfs destroy -r rpool/export/test/src@2018-10-11-222644
# zfs destroy -r rpool/export/test/src@2018-10-11-223057
# zfs destroy -r rpool/export/test/src@2018-10-11-223503
# zfs destroy -r rpool/export/test/src@2018-10-11-223610
# zfs destroy -r rpool/export/test/src@2018-10-11-223709
# zfs destroy -r rpool/export/test/src@2018-10-11-223954
# zfs destroy -r rpool/export/test/src@2018-10-11-224007
# zfs destroy -r rpool/export/test/src@2018-10-11-224150
# zfs destroy -r rpool/export/test/src@2018-10-11-224234
# zfs destroy -r rpool/export/test/src@2018-10-11-224311
# zfs destroy -r rpool/export/test/src@2018-10-11-224433
# zfs destroy -r rpool/export/test/src@2018-10-11-224506
# zfs destroy -r rpool/export/test/src@2018-10-11-224631
# zfs destroy -r rpool/export/test/src@2018-10-11-224716
# zfs destroy -r rpool/export/test/src@2018-10-11-224834
# zfs destroy -r rpool/export/test/src@2018-10-11-225120
# zfs destroy -r rpool/export/test/src@2018-10-11-225208
# zfs destroy -r rpool/export/test/src@2018-10-11-225226
[Thu Oct 11 23:02:37 2018] [debug] now will look if there is anything to clean in children of rpool/export/test/src
# zfs list -H -o name -t snapshot -s creation -d 1 rpool/export/test/src/child
[Thu Oct 11 23:02:37 2018] [info] done with backupset rpool/export/test/src in 1 seconds
[Thu Oct 11 23:02:37 2018] [debug] send/receive worker for rpool/export/test/src done (1752)

oetiker commented 5 years ago

are you cleaning recursively in any case or only for filesets which have recursive enabled ?

jimklimov commented 5 years ago

Deployed this change to our Solaris 10 server (backported to cswznapzend release), no data seems eaten :)

Difference in timing for comparable resync jobs was dramatic, especially where ZFS trees were big and branchey; here's a run with old code a couple of days ago and with new one today for 3 different trees:

/var/tmp/znap.log:real  15m35.561s
/var/tmp/znap.log:user  0m2.046s
/var/tmp/znap.log:sys   0m10.186s

/var/tmp/znap2.log:real 6m24.706s
/var/tmp/znap2.log:user 0m1.682s
/var/tmp/znap2.log:sys  0m10.616s

###

/var/tmp/znap.log:real  1509m31.119s
/var/tmp/znap.log:user  1m56.800s
/var/tmp/znap.log:sys   4m5.472s

/var/tmp/znap2.log:real 73m36.980s
/var/tmp/znap2.log:user 0m23.250s
/var/tmp/znap2.log:sys  1m39.852s

###

/var/tmp/znap.log:real  0m59.732s
/var/tmp/znap.log:user  0m1.306s
/var/tmp/znap.log:sys   0m4.248s

/var/tmp/znap2.log:real 0m26.057s
/var/tmp/znap2.log:user 0m0.866s
/var/tmp/znap2.log:sys  0m1.819s

Note that just the loop of listing remaining snapshots in recursive child datasets one by one maybe should have been optimized to be one zfs and later destroySnapshots() call too: just looking to find nothing to do took about 6 minutes.

jimklimov commented 5 years ago

And yes, the recursive cleanup mode currently only kicks in (both for destinations and the source) if the source dataset policy was with recursive backup enabled.

jimklimov commented 5 years ago

So with this patch on the server for a week, it has not suffered any unexpected losses :) and manages to do its snap/sync/cleanup loops mostly... with a 2hr=>1hr policy it sees some 3-4 snapshts in the source (where I think +1 is by design, so 3 snaps in place are okay), not a couple of days of backlog like when we had 1day=>1hr and no recursive destroy.

jimklimov commented 4 years ago

Also read today about an ability to have recursive backup plan, but have it not enabled on some child datasets. I wonder if this recursive mass-removal would contradict anything (e.g. cleaning up datasets that we did not intend to actively back up? then it's good... Or to touch at all? then it's bad...)

jimklimov commented 4 years ago

Closing the issue as the solution proposed in #386 got merged today :)

oetiker / znapzend

Can speed up a lot with recursive "zfs destroy" #385