CSWznapzend on Solaris 10: Can't prune old snapshots: invalid argument name

jimklimov commented 6 years ago

I'm using a CSWznapzend package (based on 0.13 release I guess) on Solaris 10u10, which tends to fail cleaning up older snapshots when some time has passed; I think it may be due to calling zfs CLI not supported in that version:

[Thu Jul 12 09:18:30 2018] [debug] cleaning up snapshots on backup/export/DUMP/regular/ucs-oracle-gz
# zfs destroy backup/export/DUMP/regular/bigdata@znapzend-auto-2018-07-03T00:00:00Z,znapzend-auto-2018-07-04T00:00:00Z,znapzend-auto-2018-07-05T00:00:00Z
cannot open 'backup/export/DUMP/regular/bigdata@znapzend-auto-2018-07-03T00:00:00Z,znapzend-auto-2018-07-04T00:00:00Z,znapzend-auto-2018-07-05T00:00:00Z': invalid dataset name
[Thu Jul 12 09:18:30 2018] [warn] ERROR: cannot destroy snapshot(s) backup/export/DUMP/regular/bigdata@znapzend-auto-2018-07-03T00:00:00Z
[Thu Jul 12 09:18:30 2018] [info] done with backupset backup/export in 0 seconds
[Thu Jul 12 09:18:30 2018] [debug] send/receive worker for backup/export done (4658)

I am not sure if newer znapzend versions fixed it; but at least CSW users can know there is this problem in the version available to them. I'll try to install the source version when I have time (not much online in the next weeks) to check.

The effect is that pools fill up with old auto-snapshots that are never deleted and eventually the system collapses. A collateral issue seems to be that if a pool is full (or dataset tree quota is exceeded) and znapzend can't clean away old snaps, it can't create new ones, and can't send anything to the backup at all - does not even try sending whatever it does have.

oetiker commented 6 years ago

use the --features=oracleMode as described in the manual page.

jimklimov commented 6 years ago

Thanks a lot, I can confirm this works with the CSW version at least in runonce mode. For service daemon, seems the method/init script has to be hacked directly at /var/opt/csw/svc/method/svc-cswznapzend. I'll see how that goes...

jimklimov commented 6 years ago

One caveat is that deletion of a singular dataset (snapshot) takes at least a few seconds for some sync zfs metadata updates (no userdata referred to free away), so with a two week's worth of backlog on a branchey data tree, it has not finished after a day of runtime.

If autoznapz are made with recursion, they should be destroyed with recursion from 'src' parent, at least as a first pass (if that failed, retry one by one). It takes about as long as killing one dataset. I am not sure if current code has it, so just a note ;)

Also from my practice, firing off hundreds of zfs destroy commands to background takes also as long to do a lot of good quickly.

oetiker / znapzend

CSWznapzend on Solaris 10: Can't prune old snapshots: invalid argument name #369