Added --send-delay option, fixes race condition with direct TCP transfer

dismantl commented 1 year ago

Kinda surprised no one else has run into this before me, but I recently tried running a backup job to a remote server using direct TCP transfer and kept running into a race condition where the send side of the netcat connection would execute before the listen side and the backup job would hang:

$ zfs-autobackup -v --ssh-target root@remote.example.com --progress --strip-path=1 --force --send-pipe "nc -v remote.example.com 1234" --recv-pipe "nc -nvlp 1234" aws data
...
  #### Synchronising
  [Source] zfs send custom pipe   : nc -v remote.example.com 1234
  [Target] zfs recv custom pipe   : nc -nvlp 1234
  [Source] hotrod: sending to datad)
  [Source] hotrod/crypt: sending to data/crypt
  [Source] hotrod/crypt/gitlab: sending to data/crypt/gitlab
  [Target] data/crypt/gitlab@aws-20230816111510: receiving incremental
! [Source] STDERR > nc: connect to remote.example.com (X.X.X.X) port 1234 (tcp) failed: Connection refused
! [Target] STDERR > Listening on 0.0.0.0 1234

I retried a bunch of times and this would happen almost every time, with the occasional success. I retried with --debug and --dry-run and saw the command pipeline that got executed:

...
# [Target] CMDSKIP> (zfs send --large-block --embed --raw --verbose --parsable --props -i @aws-20230815084358 hotrod/crypt/gitlab@aws-20230816102918 | nc -v remote.example.com 1234) | (ssh
root@remote.example.com 'nc -nvlp 1234 | zfs recv -u -v -F -s data/crypt/gitlab')
...

I realized that since I'm using netcat for direct TCP transfer, the pipe from the send commands to the recv commands is superfluous and I could just add a sleep before the zfs send | nc commands to make sure it didn't start before the netcat listener was ready.

This PR implements the --send-delay option, which does just this, and consistently solves the race condition for me with just a 1-second delay:

$ zfs-autobackup -v --ssh-target root@remote.example.com --progress --strip-path=1 --force --send-pipe "nc -v remote.example.com 1234" --recv-pipe "nc -nvlp 1234" --send-delay 1 aws data
...
  #### Synchronising
  [Source] zfs send custom pipe   : nc -v remote.example.com 1234
  [Target] zfs recv custom pipe   : nc -nvlp 1234
  [Source] hotrod: sending to datad)
  [Source] hotrod/crypt: sending to data/crypt
  [Source] hotrod/crypt/gitlab: sending to data/crypt/gitlab
  [Source] hotrod/crypt/gitlab@aws-20230816112547: Destroying
  [Target] data/crypt/gitlab@aws-20230816112547: Destroying
  [Target] data/crypt/gitlab@aws-20230816122959: receiving incremental
! [Target] STDERR > Listening on 0.0.0.0 1234
! [Source] STDERR > Connection to remote.example.com (X.X.X.X) 1234 port [tcp/*] succeeded!
! [Target] STDERR > Connection received on X.X.X.X 15081
  [Target] data/crypt/gitlab@aws-20230816123127: receiving incremental
! [Target] STDERR > Listening on 0.0.0.0 1234
! [Source] STDERR > Connection to remote.example.com (X.X.X.X) 1234 port [tcp/*] succeeded!
! [Target] STDERR > Connection received on X.X.X.X 3051
  [Target] data/crypt/gitlab@aws-20230816123226: receiving incremental
! [Target] STDERR > Listening on 0.0.0.0 1234
! [Source] STDERR > Connection to remote.example.com (X.X.X.X) 1234 port [tcp/*] succeeded!
! [Target] STDERR > Connection received on X.X.X.X 33524
...

psy0rz commented 1 year ago

i'm not sure what to think of this, it feels kindof hackish. it would be better if the send tool keeps retrying. (perhaps some other tool than nc)

coveralls commented 1 year ago

coverage: 86.334% (-0.1%) from 86.449% when pulling c0f66c253dd033f6a16e3b006f47ad17a45062fb on dismantl:send-delay into ee1d17b6ff942d6a2d151c3bdaa664643cd80413 on psy0rz:master.

mduller commented 1 year ago

I agree that it is a bit hacky. socat, as alternative to nc, provides a flexible retry option: http://www.dest-unreach.org/socat/doc/socat.html#OPTION_RETRY

dismantl commented 1 year ago

I agree that it is a bit hacky. socat, as alternative to nc, provides a flexible retry option: http://www.dest-unreach.org/socat/doc/socat.html#OPTION_RETRY

oh good to know, I'll give that a try. Feel free to close this PR if you like.

psy0rz commented 1 year ago

alright let me know how it goes, otherwise it was still a good PR :)

psy0rz / zfs_autobackup

Added --send-delay option, fixes race condition with direct TCP transfer #211