oetiker / znapzend

zfs backup with remote capabilities and mbuffer integration.
www.znapzend.org
GNU General Public License v3.0
604 stars 136 forks source link

Thinning out after syncing seems impractical #579

Closed DavHau closed 1 year ago

DavHau commented 1 year ago

Hello, I'm not an expert of zfs snapshotting, so I might be missing something here. I just wanted to share my opinion on how znapzend currently deletes (thins out) local snapshots, since the behavior seems weird to me.

I think in general the order of action might be impractical. Currently znapzend first sends, then deletes. Wouldn't it be better to be the other way round? First delete, then send.

Issue 1: Network Overhead

Znapzend seems to sync snapshots first and then thin them out. I'm now in a situation where I was disconnected from my destination for around 3 months. It is now syncing hourly snapshots through the internet, just to delete like 98% of that data afterwards on the destination. In other terms: znapzend produces network overhead of 98%. Only 2% of that data is useful data. I'm often on a constraint network and that overhead impacts the experience significantly.

Issue 2: local drive getting full

Another annoying side effect of that order of action is that since I was disconnected for 3 months, my SSD is running full. Because the remote is gone, it just never deletes anything. I've seen there is a --cleanOffline flag that is supposed to fix it, but that seems to be buggy (see error below) and not working at all.

But even assuming that would work, my question is. Why is this not the default? And why does it even wait for a failed sync attempt to start deleting. Why not just delete first and then sync over the network?

The man page on --cleanOffline describes why it could be a risk, but I can't really follow: The most recent common snapshot for each destination (as tracked on source for resilience) will not be deleted from source, but this is still a potentially dangerous option: if the preserved snapshot somehow gets deleted from the destination, it may require a full re-replication the next time it is online.

I don't really get that point. It could potentially be dangerous if the wrong snapshot gets deleted on the destination. But why should that ever happen? Computers don't do things accidentally. The only reason why the wrong snapshot would be deleted, is if a user deleted it manually. But I'd argue that this is a non issue. A user should never delete znapzend managed snapshots manually. They should be handled by znapzend only. If a users touches data they shouldn't touch, chaos is inevitable. There is nothing you can do to protect them. Therefore, I don't see a reason this risk needs to be taken care of and influence the design of the tool.


Error for --cleanOffline: (The datasets it claims to not exist are snapshots that actually do exist)

[2022-07-21 23:42:11.45575] [855779] [warn] ERROR: cannot send snapshots to [...]
[2022-07-21 23:42:11.45601] [855779] [warn] ERROR: 1 send task(s) below failed for master/home, but "cleanOffline" mode is on, so proceeding to clean up source dataset carefully:
[2022-07-21 23:42:11.45606] [855779] [warn]  +-->   ERROR: cannot send snapshots to [...]
[2022-07-21 23:42:11.45611] [855779] [debug] checking to clean up snapshots recursively from source master/home
cannot open 'master@2022-07-21-234200': dataset does not exist
cannot open 'master@2022-07-21-233358': dataset does not exist
cannot open 'master@2022-07-21-230116': dataset does not exist
...
[500 more of these lines]
...
cannot open 'master@2022-05-17-010000': dataset does not exist
cannot open 'master@2022-05-16-200000': dataset does not exist
cannot open 'master@2022-05-16-190000': dataset does not exist
[2022-07-21 23:42:21.67076] [855779] [debug] cleaning up 543 source snapshots recursively under master/home
cannot destroy snapshot master/home@2022-06-12-150000: dataset is busy
cannot destroy snapshot master/home@2022-07-10-190000: dataset is busy
cannot destroy snapshot master/home@2022-05-29-010000: dataset is busy
...
[500 more of these lines]
...
cannot destroy snapshot master/home@2022-05-31-180000: dataset is busy
cannot destroy snapshot master/home@2022-07-02-220000: dataset is busy
cannot destroy snapshot master/home@2022-07-18-190000: dataset is busy
[2022-07-21 23:42:21.93575] [855779] [warn] ERROR: cannot destroy snapshot(s) master/home@2022-04-26-010000
[2022-07-21 23:42:21.93600] [855779] [debug] now will look if there is anything to clean in children of source master/home
[2022-07-21 23:42:21.93628] [855779] [info] done with backupset master/home in 21 seconds
[2022-07-21 23:42:21.93820] [855701] [debug] send/receive worker for master/home done (855779)
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

DavHau commented 1 year ago

Still an issue