jimsalterjrs / sanoid

These are policy-driven snapshot management and replication tools which use OpenZFS for underlying next-gen storage. (Btrfs support plans are shelved unless and until btrfs becomes reliable.)
http://www.openoid.net/products/
GNU General Public License v3.0
3.1k stars 303 forks source link

How to solve keeping multiple snapshots on the target and not accidentally deleting the last 'common' snapshot? #533

Closed darkpixel closed 3 years ago

darkpixel commented 4 years ago

I have 24 boxes that all use sanoid for snapshots and syncoid to back snapshots up to a set of remote boxes.

I goofed up a few days ago. All the source boxes take yearly, monthly, weekly, daily, hourly, and frequent (15 minute) snapshots. I run with the --no-stream option because I don't need every single intermediate snapshot backed up off-site. Just a once-per-day off-site backup. The difference between an intermediary send in 24 hours is several hundred gigs due to temporary data churn throughout the day, whereas a diff between two snapshots 24-hours apart is usually under 5 GB.

Running syncoid without --no-sync-snap creates a snapshot and sends it off-site. But it also deletes the previous snapshot for some reason. This means I only get the most current backup off-site. I'd like to have several days worth of off-site backups to roll back to.

So I figured I would use --no-sync-snap. This mostly accomplishes what I want--it finds the latest snapshot and sends it off-site.

Unfortunately, what I failed to account for is that it's usually syncing a 'frequent' snapshot.

I had 'frequent' snapshots set to keep 96 of them hanging around. 96 snapshots / 24 hours = 4/hr (every 15 minutes).

I'm sure you can see where this is going... ;)

Due to some maintenance, backups were unable to run on-time a few days ago. The source boxes dutifully kept removing frequent snapshots...and now there is no common snapshot between the source and target boxes.

Oops. I immediately increased the 'frequent' snapshots to keep 4 days worth, and I'm planning to re-sync about 40 TB of data across some really slow Comcast connections.

Anyways....I'm trying to find a better solution to this, but reading through the docs hasn't turned up anything fruitful.

Ideally, I would love to NOT use ---no-sync-snap and have the ability to skip deleting the previous synced snapshot and allow me to manually manage (or use a tool) to clean up old off-site backups. Or maybe an option to --keep 5 or something.

Or with --no-sync-snap tell syncoid to consider only snapshots with 'daily' in the name? (I've seen a few tickets about a feature request).

Any pointers on how I can get data retention on the target while also not using the 'more ephemeral' snapshots?

phreaker0 commented 4 years ago

@darkpixel syncoid doesn't delete old snapshots. In your case you aren't sending any intermediate snapshots (--no-stream) so they never existed on the remote to begin with. You should drop the --no-stream option so those intermediate snapshots are also sent and let them be cleaned up by sanoid on the remote as you like. If #153 get implemented some time in the future one can also skip some intermediate snapshots.

darkpixel commented 4 years ago

I don't want intermediate snapshots. Intermediate snapshots means I need to transfer several hundred gigs of unneeded intermediary snapshots every day instead of a ~5 GB difference between daily snapshots. There's a lot of temporary data written during work-hours. i.e. a 75 MB image gets written to the drive and then 30 minutes later it gets 'processed' and deleted.

When I run: syncoid --recursive --no-stream --sendoptions="--raw" tank/officeshare root@off-site-backup-host:tank/backups/officeshare ...it creates a snapshot @syncoid_<hostname>_d-a-t-e-t:i:m:e (Ugh--colons in a snapshot name). This gets sent to the remote box. When I run the command again, it creates a new snapshot, sends it to the remote box, and then the first syncoid snapshot I sent gets removed. Is it a config option I'm missing?

darkpixel commented 4 years ago

I mean...I'm perfectly fine if the remote box ends up with @syncoid... snapshots piling up, but that doesn't appear to happen.

darkpixel commented 4 years ago

Just tested:

# Source host
syncoid --recursive --no-stream --sendoptions="--raw" tank/virt root@off-site-backup-host:tank/backups/uslogsd/virt
# Target host
root@uswuxsdbkp01:~# zfs list -rt snapshot -o name -s name tank/backups/uslogsd/virt/vm-100-disk-1
NAME
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-14_05:45:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-15_01:45:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-16_02:00:02_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-16_18:30:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-17_02:45:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-17_23:15:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-21_03:00:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@syncoid_uslogsdnas01_2020-04-21:10:25:57

Ran the same sync command again on the source host

# Target host
root@uswuxsdbkp01:~# zfs list -rt snapshot -o name -s name tank/backups/uslogsd/virt/vm-100-disk-1
NAME
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-14_05:45:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-15_01:45:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-16_02:00:02_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-16_18:30:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-17_02:45:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-17_23:15:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@autosnap_2020-04-21_03:00:01_frequently
tank/backups/uslogsd/virt/vm-100-disk-1@syncoid_uslogsdnas01_2020-04-21:11:21:22

Looks like the first snapshot @syncoid_uslogsdnas01_2020-04-21:10:25:57 was deleted after @syncoid_uslogsdnas01_2020-04-21:11:21:22 was transferred successfully.

phreaker0 commented 4 years ago

Ah, now I know what you mean. The syncoid snapshots are designed to ensure that replication is always possible and after a successful replication only the last one is kept. There is no switch to keep them.

darkpixel commented 4 years ago

I guess the simplest path forward would be to some sort of --no-delete option to syncoid (similar to --force-delete). But I'm not sure if that's something that everyone would want included in syncoid.

darkpixel commented 4 years ago

As a temporary work-around, I am launching syncoid twice. Once with --no-scan-snap and a second time without it. This gives me the 'backup history' I want, but also ensures there's a sanoid snapshot that won't get auto-deleted.

delxg commented 4 years ago

Would the combination of --create-bookmark --no-sync-snap do the trick?

phreaker0 commented 4 years ago

@delxg yes, this should work too.

darkpixel commented 4 years ago

Does anyone have a good pointer to an explanation of bookmarks? I've Googled and read a bunch of stuff, but I'm still missing something...

I'll play around with it on a test server and see if that works.

phreaker0 commented 4 years ago

@darkpixel https://www.reddit.com/r/zfs/comments/5op68q/can_anyone_here_explain_zfs_bookmarks/

darkpixel commented 4 years ago

@phreaker0 Yeah--I found that one while Googling. I'm still missing something. But I just ran a few tests and it works, so I'll chalk it up to 'magic' until I get more time to sit down and study it.

I'm not sure if you want this closed, or if you want to keep it open. Having a history of snapshots is useful, but then again it looks like the combo of --create-bookmark and --no-sync-snap does what I want in a round-about way. I am a little concerned about restored though since it appears you can send from a bookmark, but not receive from a bookmark. I'll have to do a bit more testing.

phreaker0 commented 4 years ago

@darkpixel I think you missed the linked PR, i made an option to keep the sync snapshots :-) This issue will auto close if the PR gets merged.

darkpixel commented 4 years ago

Oops. Yeah, I missed it. I'll merge it in to my local copy and test.

darkpixel commented 4 years ago

I found a pretty good explanation of bookmarks that filled in some of the 'holes' I was missing from the reddit thread: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSBookmarksMechanism

Tested the patch on one production server and it worked perfectly. I will be pushing it to a few production servers tomorrow and if there are no issues it'll be running on ~40 servers by Friday evening.

SinisterCrayon commented 3 years ago

Just a quick note on this old thread; I solved this issue by using Sanoid on the destination host with a simple configuration that preserves 30 days plus 3 months of monthlies. My backup script just fires off "sanoid --cron" on the remote host before the backup runs which prunes old backups and creates fresh ones.

I don't do it in Cron because my backup host is designed to be turned off most of the time; my backup job sends a WOL packet via a jump host (a Raspberry Pi in the same network as the backup target), then the script runs Sanoid, backs up and then shuts the system down (if a scrub isn't running).

And yes; I trigger a scrub via a cron job on the main array so that when it runs its own scrub it also turns on the backup destination and fires off a scrub there too, creating a tmp file that my script looks for to identify that it's in "scrub mode". I deliberately don't monitor for the scrub finishing as it then gets me to remote into the host and check for corruption and so forth after the scrub has run and resolve if necessary.

This all works together pretty well and means I maintain old snapshots that are different from the snapshot on my main array (that has frequent as well as dailies, weeklies and monthlies)