jimsalterjrs / sanoid

These are policy-driven snapshot management and replication tools which use OpenZFS for underlying next-gen storage. (Btrfs support plans are shelved unless and until btrfs becomes reliable.)
http://www.openoid.net/products/
GNU General Public License v3.0
3.07k stars 300 forks source link

Issue with ZFS Replication Using Syncoid #937

Open Mikesco3 opened 1 month ago

Mikesco3 commented 1 month ago

Description

I am experiencing recurring issues with ZFS replication using syncoid.

I have scheduled a script to run every two hours to synchronize datasets between an SSD and a pool of hard drives and to another server.

The script often fails during the zfs send and zfs receive processes with errors like:

My Script:

Here is the script that I have scheduled:

#!/usr/bin/bash

## tfh-fs00 Server to pve2
/usr/sbin/syncoid --force-delete --identifier=pve2 fast200/_VMs/vm-111-disk-0 pve2:tank100/vm-10111-disk-0 && \
/usr/sbin/syncoid --force-delete --identifier=pve2 fast200/_VMs/vm-111-disk-1 pve2:tank100/vm-10111-disk-1 && \
/usr/sbin/syncoid --force-delete --identifier=pve2 fast200/_VMs/vm-111-disk-2 pve2:tank100/vm-10111-disk-2

## tfh-fs00 Server from SSD fast200 to HDs on rpool
/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-0 rpool/_VMs/vm-10111-disk-0 && \
/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-1 rpool/_VMs/vm-10111-disk-1 && \
/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-2 rpool/_VMs/vm-10111-disk-2 

Schedule

run every two hours using cron:

17 */2 * * * (/root/mirrorVMs_to-PVE2-Shadows.sh) > /dev/null

Error Sample

mbuffer: error: outputThread: error writing to <stdout> at offset 0x40000: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-2'@'syncoid_rpool_tfh-pve1_2024-07-11:04:23:02-GMT-05:00' 'fast200/_VMs/vm-111-disk-2'@'syncoid_rpool_tfh-pve1_2024-07-12:20:20:47-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 950304232 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-2' 2>&1 failed: 256

mbuffer: error: outputThread: error writing to <stdout> at offset 0x20000: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:20:17:45-GMT-05:00' 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:22:17:22-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 117088184 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-1' 2>&1 failed: 256

CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-0'@'syncoid_rpool_tfh-pve1_2024-07-11:06:17:17-GMT-05:00' 'fast200/_VMs/vm-111-disk-0'@'syncoid_rpool_tfh-pve1_2024-07-12:18:17:27-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 34944 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-0' 2>&1 failed: 256

Reprouction

I try to run the syncoid line manually, and some go through fine, and then one will throw this error:

/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-1 rpool/_VMs/vm-10111-disk-1
INFO: Sending incremental fast200/_VMs/vm-111-disk-1@syncoid_rpool_tfh-pve1_2024-07-12:20:17:45-GMT-05:00 ... syncoid_rpool_tfh-pve1_2024-07-12:23:17:37-GMT-05:00 to rpool/_VMs/vm-10111-disk-1 (~ 1.8 GB):
cannot restore to rpool/_VMs/vm-10111-disk-1@autosnap_2024-07-13_02:00:18_hourly: destination already exists
64.0KiB 0:00:00 [ 423KiB/s] [>                                                                            ]  0%            
mbuffer: error: outputThread: error writing to <stdout> at offset 0x30000: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:20:17:45-GMT-05:00' 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:23:17:37-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 1968025760 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-1' 2>&1 failed: 256
Use of uninitialized value $existing in string eq at /usr/sbin/syncoid line 750.

Troubleshooting I've attempted:

Additionally:

Here is the portion of my sanoid.conf that is relevant to this:


[fast200/_VMs]
    use_template = production
    recursive = yes

[rpool]
    use_template = production
    recursive = yes

[rpool/_VMs]
    use_template = production
    recursive = yes

[fast200/_VMs/vm-112-disk-0]
    use_template = ignore

[fast200/_VMs/vm-112-disk-1]
    use_template = ignore

[rpool/_VMs/vm-112-disk-0]
    use_template = ignore

## This is for the replica VM of tfh-fs00
[rpool/_Shadows]
    use_template = shadows

[rpool/_VMs/vm-10111-disk-0]
    use_template = shadows

[rpool/_VMs/vm-10111-disk-1]
    use_template = shadows

[rpool/_VMs/vm-10111-disk-2]
    use_template = shadows

#############################
# templates below this line #
#############################

[template_production]
    frequently = 0
    hourly = 36
    daily = 8
    monthly = 1
    yearly = 0
    autosnap = yes
    autoprune = yes

[template_backup]
    autoprune = yes
    frequently = 0
    hourly = 0
    daily = 31
    monthly = 6
    yearly = 0

    ### don't take new snapshots - snapshots on backup
    ### datasets are replicated in from source, not
    ### generated locally
    autosnap = no

    ### monitor hourlies and dailies, but don't warn or
    ### crit until they're over 48h old, since replication
    ### is typically daily only
    hourly_warn = 2880
    hourly_crit = 3600
    daily_warn = 48
    daily_crit = 60

[template_shadows]
    autoprune = yes
    frequently = 0
#   hourly = 0
    daily = 31
    monthly = 6
    yearly = 0

[template_hotspare]
    autoprune = yes
    frequently = 0
    hourly = 30
    daily = 9
    monthly = 0
    yearly = 0

    ### don't take new snapshots - snapshots on backup
    ### datasets are replicated in from source, not
    ### generated locally
    autosnap = no

    ### monitor hourlies and dailies, but don't warn or
    ### crit until they're over 4h old, since replication
    ### is typically hourly only
    hourly_warn = 4h
    hourly_crit = 6h
    daily_warn = 2d
    daily_crit = 4d

[template_scripts]
    ### information about the snapshot will be supplied as environment variables,
    ### see the README.md file for details about what is passed when.
    ### run script before snapshot
    pre_snapshot_script = /path/to/script.sh
    ### run script after snapshot
    post_snapshot_script = /path/to/script.sh
    ### run script after pruning snapshot
    pruning_script = /path/to/script.sh
    ### don't take an inconsistent snapshot (skip if pre script fails)
    #no_inconsistent_snapshot = yes
    ### run post_snapshot_script when pre_snapshot_script is failing
    #force_post_snapshot_script = yes
    ### limit allowed execution time of scripts before continuing (<= 0: infinite)
    script_timeout = 5

[template_ignore]
    autoprune = no
    autosnap = no
    monitor = no

I'm practically pulling my hair out, I don't have this issue on any of my other proxmox servers...

Mikesco3 commented 1 month ago

Update:

I copied over the executables from version 2.1.0 and my problem disappeared....

Mikesco3 commented 1 month ago

I think I may have found the issue... I must have forgot to turn on sanoid.timer ? I ran systemctl enable --now sanoid.timer sanoid wasn't pruning the old snapshots... I see it pruning now a bunch of old snapshots so I'm cautiously optimistic....