Harvie commented 5 years ago

I am using znapzend to sync 37 LXC containers to another server. My main goal is to have secondary server with data replica not older than 15 minutes. I have following setup:

src_plan        = 1hour=>15minutes,3days=>1hour
dst_a_plan      = 1hour=>15minutes,3days=>1hour

That worked great for me, everything was syncing in few minutes as expected, but here's the problem: DST server went down for 22 days. So local snapshots started accumulating. This means following:

37 datasets 22 days 24 hours * 4 snapshots per hour = 78144 SNAPSHOTS TOTAL

In fact i've just checked real numbers and it's even more! 147673 little snapshots, mostly around 2-10 MiB each.

At this point i don't even care about the snapshots. I just want to get up to sync as soon as possible. BUT the ZFS send is slow with so many snapshots. My hardware is simply not up to the task of managing so many snapshots... Also this ate almost all of my free diskspace in the pool. If i don't manage to sync this fast, the pool will get completely full and i will have to delete all snapshots. Everything's jammed up, just because backup server was down for some time... This is hell. What should i do?

I would be very happy if znapzend was able to drop all my SRC snapshots except for ones that fit this criteria 1hour=>15minutes,3days=>1hour and of course the one snapshot of each dataset needed for incremental send. That way number of snapshots would be greatly reduced and i would be able to get into the sync again. (without having to do full non-incremental sync)

Do you think it would be possible to add znapzend setting, which would allow to expire unsynced SRC snapshots if they are not needed for incremental diff? I understand the consequences of loosing rollback capabilities if things get out of sync... But i will have nagios check for this anyway... Obviously this should be implemented as opt-in feature.

Harvie commented 5 years ago

Also it does not even make sense to localy store SRC snapshots that are older than retention policy for DST, because they will get remotely destroyed right after replication anyway... Only exception is the oldest one needed for incremental send.

Another related problem is that when single one of the datasets can't be replicated, znapzend suspends purging of old snapshots for ALL datasets, which amplifies the problem of snapshot bloat.

DSpeichert commented 5 years ago

Only exception is the oldest one needed for incremental send.

Or a bookmark. https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSBookmarksWhatFor

jimklimov commented 4 years ago

Another related problem is that when single one of the datasets can't be replicated, znapzend suspends purging of old snapshots for ALL datasets, which amplifies the problem of snapshot bloat.

And for that we have a special use-case in the sprawl hell: a full source dataset, so snapshots can not be created (out of quota), so can not be sent, so it can not be cleaned...

Harvie commented 4 years ago

Any news on this one? It's not funny, it can starve whole system of disk space...

lheckemann commented 4 years ago

384 implements this improvement as far as I understand.

jimklimov commented 4 years ago

Also it does not even make sense to localy store SRC snapshots that are older than retention policy for DST, because they will get remotely destroyed right after replication anyway.

I get the idea behind this complaint and share it in some cases, but it is not a generally universally true statement.

For one, you might have a large enough local system storage and traffic of data changes that on one hand you want to roll back and/or pick data from a snapshot to the longer (and/or more detailed) past history, but also keep remote destinations to recover from catastrophic events. Recoveries from local snapshots are usually faster and may be easier to compare stuff - depending on what your data use-cases are and workflows involved...

For another, you might be e.g. replacing destinations and/or planning an additional one, e.g. stuck with a temporary small USB drive as a znapzend PoC but expecting a NAS to come to your LAN any month now, so there is no reason to chop off old source snapshots (if they do not impair performance) just because they are old automatically.

So here's a few ideas from recent practice :) That said, an option (non-default) to cut history of source based on the limit of longest destination policy may make sense for some users. Just hope to not have an interim typo asking znapzendzetup to do its magic with a wrongly short policy, and nuking your source rollbacks in progress ;p

Harvie commented 4 years ago

I get the idea behind this complaint and share it in some cases, but it is not a generally universally true statement.

You can set longer local retention period individualy. In my case these snapshots were overdue both localy and remotely.

In my case i had retention period set longer remotely than localy.

They were localy expired, but still kept because they were not transfered to destination yet. However they would be deleted in the destination right after transfer anyway, because they were already expired remotely as well (even before the transfer was able to happen).

jimklimov commented 4 years ago

I think #384 is not fully a solution to this problem (e.g. it won't help avoid sending snapshots to be deleted on arrival - though the skipIntermediates feature can help in this regard) but it can help against the problem of bloat allowing znapzend to delete automatic snapshots that have expired and whose loss is not fatal to resume replication when the destination pool comes back.

Since today (with #506 merged), that logic should be part of master branch... beware :)

Harvie commented 4 years ago

To clarify the things. I am not really concerned about sending expired snapshots only to be deleted. That would be still reasonable as this will only occur after long downtime of destination server, which should not happen very often. (unless you use it for oportunistic backup to home NAS only when you are on home network).

My problem was that during such downtime of the ("non-critical") destination server the expired snapshots had accumulated on the (critical) source server and caused filesystem to run out of space, which caused painful downtime of source server.

This caused low-availability server to bring down the "high" availability server. No bueno.

jimklimov commented 4 years ago

Yeah, I felt that pain too, more than once :)

Harvie commented 3 years ago

Can we expect this to get fixed please?

jimklimov commented 3 years ago

I believe with #384 and later PRs, znapzend now should delete local source snapshots that are old enough by policy, except the newest one which was known-sent to a destination, so replication can resume from it and source does not overflow. I think this mode kicks in with the cleanOffline option. When you repair a destination and want to skip lots of history, you can znapzend --runonce=... --feature=skipIntermediates to send just one big incremental change between long ago to now.

Are these solutions enough (were for me, to keep a system with flaky NAS afloat while hands-off for much longer)?

Harvie commented 3 years ago

Sounds interresting... Unfortunately it did not make it to v0.20.0, so we should probably wait till release of v0.21.0 as i don't want to run master on production servers...

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Harvie commented 3 years ago

bump

lheckemann commented 3 years ago

Fixed by #506 I think?

Harvie commented 3 years ago

Cool, i will try v0.21.0 :-)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Harvie commented 9 months ago

I am running znapzend 0.21.1 with "--cleanOffline" and it does not seem to delete the local snapshots as expected when destination is offline.

oetiker / znapzend

Replication-only plan bloated my ZFS, prevent accumulating SRC snapshots #425

384 implements this improvement as far as I understand.