jimsalterjrs / sanoid

These are policy-driven snapshot management and replication tools which use OpenZFS for underlying next-gen storage. (Btrfs support plans are shelved unless and until btrfs becomes reliable.)
http://www.openoid.net/products/
GNU General Public License v3.0
3.14k stars 308 forks source link

Incremental syncs fail unless I modify syncoid and remove `2>&1` from the sync command #940

Closed wizeguyy closed 4 months ago

wizeguyy commented 4 months ago

I am trying to setup syncoid to replicate from my host at home, to a remote host with rsync.net (using their ZFS service). Both hosts are working fine, I can use ZFS send/receive manually, and everything behaves as expected.

When I try to use syncoid though, only the initial sync succeeds. Subsequent incremental syncs do not work. I have some loose theories, but first, here is the problem:

The problem (play by play):

Here is the command I run at first (running as user with sudo):

$ syncoid --sshkey=/home/user/.ssh/id_ed25519 -r --sendoptions="wp" tank/test_dataset root@myuser.rsync.net:data1/hawkeye/test_dataset

The first time I run this, the command succeeds, and I can see the dataset and any snapshots available on the remote. The second time I run this command, however, the command fails with the following error, and the new snapshots are not sent:

Sending incremental tank/test_dataset@syncoid_hawkeye_2024-07-24:20:52:38-GMT00:00 ... syncoid_hawkeye_2024-07-25:14:58:24-GMT00:00 (~ 14 KB):
41.8KiB 0:00:00 [ 103KiB/s] [============================================================================================================================================================================================] 298%            
CRITICAL ERROR: sudo zfs send -w -p  -I 'tank/test_dataset'@'syncoid_hawkeye_2024-07-24:20:52:38-GMT00:00' 'tank/test_dataset'@'syncoid_hawkeye_2024-07-25:14:58:24-GMT00:00' | pv -p -t -e -r -b -s 14352 | lzop  | mbuffer  -q -s 128k -m 16M | ssh     -i /home/user/.ssh/id_ed25519 -S /tmp/syncoid-root@myuser.rsync.net-1721919491-3389 root@myuser.rsync.net ' mbuffer  -q -s 128k -m 16M | lzop -dfc |  zfs receive  -s -F '"'"'data1/hawkeye/test_dataset'"'"' 2>&1' failed: 512 at /usr/sbin/syncoid line 889.

To diagnose this issue, I tried pasting the failed command directly:

sudo zfs send -w -p  -I 'tank/test_dataset'@'syncoid_hawkeye_2024-07-24:20:52:38-GMT00:00' 'tank/test_dataset'@'syncoid_hawkeye_2024-07-25:14:58:24-GMT00:00' | pv -p -t -e -r -b -s 14352 | lzop  | mbuffer  -q -s 128k -m 16M | ssh     -i /home/user/.ssh/id_ed25519 -S /tmp/syncoid-root@myuser.rsync.net-1721919491-3389 root@myuser.rsync.net ' mbuffer  -q -s 128k -m 16M | lzop -dfc |  zfs receive  -s -F '"'"'data1/hawkeye/test_dataset'"'"' 2>&1'

This time, the command completes without printing any error, but when I check on the remote, the snapshots were still not received. I started trying to whittle down the command, in-case an error was getting trapped.

Oddly, I manage to get the command to succeed just by removing the error output redirection (same command, but removed 2>&1 from the end):

sudo zfs send -w -p  -I 'tank/test_dataset'@'syncoid_hawkeye_2024-07-24:20:52:38-GMT00:00' 'tank/test_dataset'@'syncoid_hawkeye_2024-07-25:14:58:24-GMT00:00' | pv -p -t -e -r -b -s 14352 | lzop  | mbuffer  -q -s 128k -m 16M | ssh     -i /home/user/.ssh/id_ed25519 -S /tmp/syncoid-root@myuser.rsync.net-1721919491-3389 root@myuser.rsync.net ' mbuffer  -q -s 128k -m 16M | lzop -dfc |  zfs receive  -s -F '"'"'data1/hawkeye/test_dataset'"'"''

This time the command succeeds and my snapshots arrive as expected on the remote machine. But I am not sure why this would have any impact.

The bandaid fix:

As stated at the end of the play-by-play, just removing 2>&1 from the command makes things work.

Please help:

My fix feels like a hack/bandaid, partly because I don't really understand why it works. Does anyone know why this would change the behavior? I tried this on multiple machines (ubuntu & arch), so I'd be surprised if this has anything to do with my environment.

jimsalterjrs commented 4 months ago

What shell are you using? On FreeBSD (which rsync.net uses), for example, the default shell is often csh or tcsh, neither of which support Bourne style redirection with the same syntax. Setting the shell to either sh (Bourne) or bash (Bourne Again SHell) resolves the issue.

wizeguyy commented 4 months ago

I am running these from my Ubuntu and Arch machines. I typically run zsh, but I have switched to bash and sh and see the same issue

wizeguyy commented 4 months ago

oh duh, I am just realizing the command in question is being executed on the remote (FreeBSD). Let me change shell on the remote and retry

wizeguyy commented 4 months ago

Confirmed... setting shell to bash on the FreeBSD remote fixes the issue. Seems so obvious now. Thanks @jimsalterjrs!

wizeguyy commented 4 months ago

P.S. you should consider sharing a Monero or Bitcoin address. I'd love to buy you a few beers/coffees

jimsalterjrs commented 4 months ago

Sorry, I don't do crypto! If you'd like to support what I'm doing, there are a couple of fiat options. You can be a supporting member at https://discourse.practicalzfs.com/ for $5/mo, or you can PayPal me any arbitrary amount at paypal@jrs-s.net. Glad your problem got sorted!