jimsalterjrs / sanoid

These are policy-driven snapshot management and replication tools which use OpenZFS for underlying next-gen storage. (Btrfs support plans are shelved unless and until btrfs becomes reliable.)
http://www.openoid.net/products/
GNU General Public License v3.0
3.08k stars 301 forks source link

Feature request: syncoid --health-backup #516

Open darkpixel opened 4 years ago

darkpixel commented 4 years ago

I would love to have an option on syncoid so I can verify backups are working properly. Something along the lines of:

syncoid --health-backup --hours --warn 24 --crit 48 tank/officeshare
OK tank/officeshare: Last backup 4 hours ago

or

syncoid --health-backup --hours --warn 24 -crit 48 tank/officeshare
WARN tank/officeshare: Last backup 28 hours ago

or

syncoid --health-backup --hours --warn 24 -crit 48 tank/officeshare
CRIT tank/officeshare: Last backup 52 hours ago
redmop commented 4 years ago

If you are running sanoid, just use the sanoid --monitor-* switches on the destination.

darkpixel commented 4 years ago

That only tells me about local snapshots.

I want to know the last time syncoid successfully transferred a snapshot to the destination.

jimsalterjrs commented 4 years ago

Negative. Set up sanoid on your backup target with a policy similar to the default backup policy template (which prunes snapshots but never takes them) and use sanoid --monitor-snapshots.

If you haven't gotten a recent backup (within the guides defined in the monitoring sections of the policy) you'll get warn or crit output instead of ok output. Best part is, this also detects if you aren't getting snapshots created on the source for some reason--no new dailies or monthlies on the source, no new dailies or monthlies on the target either, even if replication ITSELF is still functional.

The --monitor-* arguments feed directly into nagios automated monitoring if desired, producing the correct output to function with that platform (or descendants like icinga).

This does assume you're running sanoid on the source as well as target. Sanoid's functionality ignores manually created snapshots which don't match its naming conventions entirely (by design). A manually created snapshot that does not match sanoid's naming conventions will never be removed by sanoid, and will not satisfy sanoid's policies on snapshot freshness either.

Syncoid itself will only ever remove snapshots matching its own naming policy, including the hostname local to the machine actually running the syncoid process. (This can create confusion if you have multiple hosts with identical hostnames, or if multiple hosts are running syncoid against the same remote source. In those cases it's usually advisable to use --no-sync-snap on all, or at least all but one, of the systems running syncoid against that source.)

On March 1, 2020 13:43:37 "Aaron C. de Bruyn" notifications@github.com wrote:

That only tells me about local snapshots. I want to know the last time syncoid successfully transferred a snapshot to the destination. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

darkpixel commented 4 years ago

If I'm running --no-sync-snap, I might get any one of a frequent, hourly, daily, weekly, monthly, or yearly snapshot. That makes the --monitor-snapshots command freak out that I'm missing tons of snapshots.

If I take snapshots I'm still not including intermediary snapshots when I sync because I'm also backing up Windows VMs and there's a lot of data churn during the day. Syncing from a snapshot one evening to a snapshot the following evening amounts to ~8 GB, but syncing every intermediate snapshot amounts to 30+ GB.

Additionally, it looks like the monitoring code (if I remove all but the daily_warn and daily_crit lines) doesn't pay attention to creation times, just names.

If I'm sending snapshots without using the --no-sync-snap option, I get a snapshot in the format syncoid_* which also doesn't match.

Lastly, since I'm currently backing up 20+ machines, it means I can no longer write one policy pointing to tank/backups. Each server dumps its backup data into tank/backups/<servername>. I would need to write 20+ policies pointing to tank/backups/<servername>.

Am I doing it wrong?

darkpixel commented 4 years ago

Oh, and in Nagios or Icinga, I'd rather have a per-host alert that tank/officeshare doesn't have a current backup verses having my backup machine display one huge alert covering every single remote server that should be backing up.

This is going go be nearly impossible to figure out when looking through the Nagios/Icinga dashboard--not to mention it won't come through via SMS, and Slack will probably block the alert because it's way too big:

Screenshot from 2020-03-01 13-17-10

jimsalterjrs commented 4 years ago

If I'm running --no-sync-snap, I might get any one of a frequent, hourly, daily, weekly, monthly, or yearly snapshot. That makes the --monitor-snapshots command freak out that I'm missing tons of snapshots.

I think you've got a conceptual error about replication here—or at least, replication using syncoid. Syncoid does replication -I, meaning you get the full snapshot tree. You'll get ALL new snapshots from the source. As long as the source had all the snapshots necessary to keep sanoid from freaking out, so will the target after the first replication. (If your target has a different policy than the source and wants deeper snapshot depth, it'll throw warnings until enough time has gone by for its archive depth to increase beyond what the source has, as the source prunes snapshots that the target does not.)

Fair point on the Nagios/Icinga warnings. Although it's worth noting that the OK/Warn/Crit part does come through, even if the text is truncated and you have to go figure out which dataset triggered it.

I'd be very open to a PR that allows individual dataset (or dataset tree) monitoring, eg sanoid --monitor-snapshots -r pool/vmimages/client1, sanoid --monitor-snapshots -r pool/vmimages/client2, sanoid --monitor-snapshots pool/vmimages/client3/dc0, which examples would only check the snapshots for vmimages/client1 (recursively), vmimages/client2 (recursively), and vmimages/client3/dc0 (just that one dataset, and no others).

There would then be the operational challenge that you need to remember to create those individual monitoring items in your system, rather than just slapping a single --monitor-snapshots on the whole box and calling it a day, but that's between you and your procedures. :)

jimsalterjrs commented 4 years ago

Lastly, since I'm currently backing up 20+ machines, it means I can no longer write one policy pointing to tank/backups. Each server dumps its backup data into tank/backups/. I would need to write 20+ policies pointing to tank/backups/.

I don't understand why you say you'd need separate policies for the separate clients. I just have a single policy for managing tons of clients on some of my own backup boxes. You can use process_children_only=yes to exclude an empty parent dataset that doesn't get new snapshots replicated in.

[data/backup/clients]
    use_template=backup
    recursive = yes
    process_children_only = yes

That's a stanza from one of my backup servers; each client to be backed up is under /data/backup/clientname, with several datasets (and in some cases several individual servers' datasets) under there. The one policy handles them all.

darkpixel commented 4 years ago

I think you've got a conceptual error about replication here—or at least, replication using syncoid. Syncoid does replication -I, meaning you get the full snapshot tree.

It doesn't appear that way: sh -c zfs send -w -i 'tank/quickbooks'@'autosnap_2020-03-01_19:00:01_frequently' 'tank/quickbooks'@'autosnap_2020-03-02_06:15:01_frequently' | pv -s 4096 | lzop | mbuffer -R 500k -q -s 128k -m 16M 2>/dev/null | ssh -p 224 -S /tmp/syncoid-root-root@<redacted>-1583130352 root@<redacted> ' mbuffer -q -s 128k -m 16M 2>/dev/null | lzop -dfc | zfs receive -s -F '"'"'tank/backups/usbvesd/quickbooks'"'"' 2>&1'

I'm using --no-sync-snap because I don't want 10-20 GB of garbage from all the intermediary snapshots--not to mention I'd rather have 14 nightly snapshots per dataset on my backup servers verses thousands. Not to mention, the backup servers don't need that level of granularity for disaster recovery.

I'd be very open to a PR that allows individual dataset (or dataset tree) monitoring, eg sanoid --monitor-snapshots -r pool/vmimages/client1, sanoid --monitor-snapshots -r pool/vmimages/client2, sanoid --monitor-snapshots pool/vmimages/client3/dc0, which examples would only check the snapshots for vmimages/client1 (recursively), vmimages/client2 (recursively), and vmimages/client3/dc0 (just that one dataset, and no others).

Aack! I completely forgot everything I knew about Perl 17 years ago. I swore I'd never go back. ;)

There would then be the operational challenge that you need to remember to create those individual monitoring items in your system

I'm using salt for configuration management everywhere--including creating the initial datasets on the server and a 'launch_syncoid_backup.sh' file to ensure they get backed up. It also generates my Icinga configs. Cut our deployment times down from ~25 days to ~8 hours.

You can use process_children_only=yes to exclude an empty parent dataset

I think I can't. I don't snapshot and backup tank on the source. I individually backup tank/virt, etc... On the target, they go into tank/backups/usbvesd/{officeshare,virt,etc}. Meaning usbvesd is a child of tank/backups which the policy applies to. Maybe I could work around this by backing up to tank/backups/usbvesd-{officeshare,virt,etc}.