jimsalterjrs / sanoid

These are policy-driven snapshot management and replication tools which use OpenZFS for underlying next-gen storage. (Btrfs support plans are shelved unless and until btrfs becomes reliable.)
http://www.openoid.net/products/
GNU General Public License v3.0
3.14k stars 308 forks source link

syncoid: ControlSocket already exists #902

Closed sdettmer closed 7 months ago

sdettmer commented 7 months ago

Hello,

thank you for sharing this the great tool and all the efforts that get put in!

I wrote systemd units for each pool (using different --identifier=EXTRA parameters), but sometimes I see errors like

ControlSocket /tmp/syncoid-root@pve-1711926601 already exists, disabling multiplexing

Often (but not always), I see other errors related to that, like

CRITICAL ERROR: zfs send -I 'tank1/home'@'syncoid_nas-datenklo2_2024-04-01:01:10:02-GMT02:00' 'tank1/home'@'syncoid_nas-datenklo2_2024-04-01:04:10:02-GMT02:00' | lzop | mbuffer -R 100M -q -s 128k -m 16M 2>/dev/null | ssh -S /tmp/syncoid-root@pve-1711937401 root@nas ' mbuffer -r 40M -q -s 128k -m 16M 2>/dev/null | lzop -dfc | zfs receive -s -F '"'"'tank1/datenklo/homel'"'"' 2>&1' failed: 65280 at /usr/sbin/syncoid line 817.

If I understood correctly, 65280 is just the Perl return value for "sub process returned 1". Are this "follow-up" errors of the control socket one?

I don't understand what it means, on the remote logs I see that snap are created and old are pruned around the same time (by sanoid), and that SSH was later disconnected by peer.

I think 1711926601 is a timestamp and the file name does not contain the --identifier=EXTRA, nor a pool name or a PID, so syncoids for the pools seem likely to share a ControlSocket filename, which probably is bad.

Would it help to add a PID, random number or even better using a tmp file generator like discussed in #532, which seems to have similar proposals and a patch. As it was not accepted apparently it is not that simple and I may have a different issue?

root@pve:~# sanoid --version
/usr/sbin/sanoid version 2.1.0
(Getopt::Long::GetOptions version 2.51; Perl version 5.32.1)
sdettmer commented 7 months ago

I'll try

--- /usr/sbin/syncoid.dist      2021-04-01 17:41:44.000000000 +0200
+++ /usr/sbin/syncoid   2024-04-01 12:02:30.824282463 +0200
@@ -1488,7 +1488,7 @@
        if ($rhost ne "") {
                if ($remoteuser eq 'root' || $args{'no-privilege-elevation'}) { $isroot = 1; } else { $isroot = 0; }
                # now we need to establish a persistent master SSH connection
-               $socket = "/tmp/syncoid-$rhost-" . time();
+               $socket = "/tmp/syncoid-$$-$rhost-" . time();

                open FH, "$sshcmd -M -S $socket -o ControlPersist=1m $args{'sshport'} $rhost exit |";
                close FH;
phreaker0 commented 7 months ago

@sdettmer This is already fixed in master: https://github.com/jimsalterjrs/sanoid/blob/19fc237476452bfa7499e6dfda77a8a6eee20b4f/syncoid#L1790