theraser / blocksync

Block device sync between remote hosts. Based off http://www.bouncybouncy.net/programs/blocksync.py
56 stars 26 forks source link

Errors with --workers via cron #4

Open klemens-u opened 6 years ago

klemens-u commented 6 years ago

If I run a script in crontab using blocksync with --workers it always fails:

Traceback (most recent call last):
  File "/root/bin/blocksync.py", line 415, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "/root/bin/blocksync.py", line 318, in sync
    p_in.write(DIFF)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/bin/blocksync.py", line 431, in <module>
    os.kill(pid, signal.SIGTERM)
OSError: [Errno 3] No such process

Environment:

ct16k commented 6 years ago

Can you do a tcpdump while the cronjob is running? And what is your network setup like? Seems like the script believes it successfully connected to the remote system but then can't talk to it.

ct16k commented 6 years ago

Is this related to #5 and can be closed or a different issue?

klemens-u commented 6 years ago

This is a different issue. I'll try to send you the desired infos soon.

davidbartonau commented 5 years ago

I had the same issue, with the error: Error copying blocksync to the remote host! however --splay 2 fixed it. PS this is a brilliant tool @ct16k thanks for sharing!

davidbartonau commented 5 years ago

@ct16k that fix doesn't always work. I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60 so it takes a minute to fail)

[worker 2] Remote fadvise: None
[worker 2] Start syncing 1195 blocks...
[worker 1] Remote fadvise: None
[worker 1] Start syncing 1195 blocks...
[worker 0] Remote fadvise: None
[worker 0] Start syncing 1195 blocks...
[worker 4] Remote fadvise: None
[worker 5] Remote fadvise: None
[worker 3] Remote fadvise: None
[worker 4] Start syncing 1195 blocks...
[worker 5] Start syncing 1195 blocks...
[worker 3] Start syncing 1195 blocks...

Worker #4 exited with 256
Worker #1 exited with 15
Worker #5 exited with 15
Worker #2 exited with 15
Worker #3 exited with 15
Worker #0 exited with 15
davidbartonau commented 5 years ago

Running with multiple workers definitely helps overcome network latency, even when your disks don't increase throughput with multiple workers.

davidbartonau commented 5 years ago

I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60 so it takes a minute to fail)

To be clear, I mean it works when run in a terminal or even in a disconnected tmux session. When run via cron or at it fails when run with multiple workers. Switching to a single worker then works. As the network it high(ish) latency, that is a bit of a pain.

davidbartonau commented 5 years ago

@theraser I have tried adding debug logging both on the sender and the receiver.

The sender ends up failing with (lines may differ as I have added logging):

Traceback (most recent call last):
  File "blocksync.py", line 439, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "blocksync.py", line 334, in sync
    p_in.write(DIFF)

To enable logging on the receiver, I used a custom python: mypython.txt

The worker that fails always has much more logging to stdout (maybe coincidental?) and nothing in stderror, whereas the other workers always have a broken pipe error (presumably because the ssh is killed).

-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.eb7YkKE9zD.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.L2OVGAD4uM.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.YAG8eaL2mp.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.77tm1d6hgK.stderr
-rw-r--r-- 1 root root     0 Sep 15 16:33 tmp.LN62mPrfFD.stderr
-rw-r--r-- 1 root root 53883 Sep 15 16:33 tmp.OdhwrXRMka.stdout
-rw-r--r-- 1 root root 46139 Sep 15 16:33 tmp.eb7YkKE9zD.stdout
-rw-r--r-- 1 root root 42427 Sep 15 16:33 tmp.YAG8eaL2mp.stdout
-rw-r--r-- 1 root root 51515 Sep 15 16:33 tmp.77tm1d6hgK.stdout
-rw-r--r-- 1 root root 41147 Sep 15 16:33 tmp.L2OVGAD4uM.stdout
-rw-r--r-- 1 root root 76667 Sep 15 16:33 tmp.LN62mPrfFD.stdout
davidbartonau commented 5 years ago

With some further digging, it looks like the sender process is attempting to write after the receiver thinks it is finished:

[worker 0] stderr: debug1: Sending environment.^M
[worker 0] stderr: debug1: Sending env LANG = en_AU.UTF-8^M
[worker 0] stderr: debug1: Sending command: /root/mypython /tmp/tmp.T963KHKglF tmpserver /dev/vg-decrypted-ssd/prod-REMOVED-bastion-disk -b 1048576 -d 3 -1 sha512 ^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype exit-status reply 0^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0^M
[worker 0] stderr: debug1: channel 0: free: client-session, nchannels 1^M
[worker 0] stderr: debug1: fd 0 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 1 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 2 clearing O_NONBLOCK^M
[worker 0] stderr: Transferred: sent 5382364, received 107348 bytes, in 48.3 seconds^M
[worker 0] stderr: Bytes per second: sent 111324.0, received 2220.3^M
[worker 0] stderr: debug1: Exit status 0^M
[worker 0] stderr: debug1: compress outgoing: raw data 24142745, compressed 5336923, factor 0.22^M
[worker 0] stderr: debug1: compress incoming: raw data 90032, compressed 87626, factor 0.97^M

It looks like the ssh finishes normally, but the sender is continuing to send.

davidbartonau commented 5 years ago

The error seems to be on the very last block:

[worker 0] Unexpected error on block 1195

It loops over and writes every other block, and then the last block the receiver has already closed. Presumably after writing the hash.

ndusart commented 3 years ago

It also fails when sending a file to a localhost path, so it is not tied to SSH.

Once a worker has finished, all other workers fails immediately.

Here are the output I receive when using 8 workers (when issuing python /root/blocksync.py -w 8 -T "/var/lib/libvirt/images/dev.qcow2" localhost "/nas/vm/.working.dev.qcow2")

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting worker #7 (pid: 953229)
[worker 7] Block size is 1.0 MB
[worker 7] Local fadvise: DONTNEED
[worker 7] Chunk size is 32004.9 MB, offset is 234917093376
[worker 7] Hash 1: sha512
[worker 7] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 7] Remote fadvise: None
[worker 7] Start syncing 32005 blocks...
[worker 7] same: 1944, diff: 30061, 32005/32005,   6.6 MB/s
[worker 7] Completed in 1:20:18

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting worker #0 (pid: 953206)
[worker 0] Block size is 1.0 MB
[worker 0] Local fadvise: DONTNEED
[worker 0] Chunk size is 32004.9 MB, offset is 0
[worker 0] Hash 1: sha512
[worker 0] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 0] Remote fadvise: None
[worker 0] Start syncing 32005 blocks...
Traceback (most recent call last):
  File "/root/blocksync.py", line 415, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "/root/blocksync.py", line 318, in sync
    p_in.write(DIFF)
BrokenPipeError: [Errno 32] Broken pipe

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Worker #7 exited with 0
Worker #0 exited with 256
Worker #0 caused ABORT
Terminating worker #1
Terminating worker #2
Terminating worker #3
Terminating worker #4
Terminating worker #5
Terminating worker #6
Terminating worker #7
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 431, in <module>
    os.kill(pid, signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe

This happen when starting the process with nohup or in a cron job. When starting directly in a shell, this completes successfuly.

ndusart commented 3 years ago

Actually, it only takes a stdout redirection to reproduce the issue.

This command line has the exact same problem:

python /root/blocksync.py -w 4 -T src localhost dest >/dev/null

Seems like the first fork that finish its blocks close its file descriptor and other forks ends up with broken pipes on their stdin/stdout.

@ct16k @theraser any clue on this ?

ndusart commented 9 months ago

As this project seems unmaintained, I allow myself to promote one of my tool that I wrote to replace blocksync and that mainly solve this issue. I didn't find a way to fix this issue by patching this project unfortunately and I needed to start a new project.

So if this issue still impacts some people, you can check deltasync.

It has also some performance improvement to maximize IO bandwith and CPU usage to compute the checksum, so you should see an improvement in the effective bandwith too.

marcin-github commented 9 months ago

There is also https://github.com/nethappen/blocksync-fast