Open klemens-u opened 6 years ago
Can you do a tcpdump while the cronjob is running? And what is your network setup like? Seems like the script believes it successfully connected to the remote system but then can't talk to it.
Is this related to #5 and can be closed or a different issue?
This is a different issue. I'll try to send you the desired infos soon.
I had the same issue, with the error: Error copying blocksync to the remote host!
however --splay 2
fixed it. PS this is a brilliant tool @ct16k thanks for sharing!
@ct16k that fix doesn't always work. I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60
so it takes a minute to fail)
[worker 2] Remote fadvise: None
[worker 2] Start syncing 1195 blocks...
[worker 1] Remote fadvise: None
[worker 1] Start syncing 1195 blocks...
[worker 0] Remote fadvise: None
[worker 0] Start syncing 1195 blocks...
[worker 4] Remote fadvise: None
[worker 5] Remote fadvise: None
[worker 3] Remote fadvise: None
[worker 4] Start syncing 1195 blocks...
[worker 5] Start syncing 1195 blocks...
[worker 3] Start syncing 1195 blocks...
Worker #4 exited with 256
Worker #1 exited with 15
Worker #5 exited with 15
Worker #2 exited with 15
Worker #3 exited with 15
Worker #0 exited with 15
Running with multiple workers definitely helps overcome network latency, even when your disks don't increase throughput with multiple workers.
I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60 so it takes a minute to fail)
To be clear, I mean it works when run in a terminal or even in a disconnected tmux
session. When run via cron
or at
it fails when run with multiple workers. Switching to a single worker then works. As the network it high(ish) latency, that is a bit of a pain.
@theraser I have tried adding debug logging both on the sender and the receiver.
The sender ends up failing with (lines may differ as I have added logging):
Traceback (most recent call last):
File "blocksync.py", line 439, in <module>
sync(i, srcdev, dsthost, dstdev, options)
File "blocksync.py", line 334, in sync
p_in.write(DIFF)
To enable logging on the receiver, I used a custom python: mypython.txt
The worker that fails always has much more logging to stdout (maybe coincidental?) and nothing in stderror, whereas the other workers always have a broken pipe error (presumably because the ssh is killed).
-rw-r--r-- 1 root root 222 Sep 15 16:33 tmp.eb7YkKE9zD.stderr
-rw-r--r-- 1 root root 222 Sep 15 16:33 tmp.L2OVGAD4uM.stderr
-rw-r--r-- 1 root root 222 Sep 15 16:33 tmp.YAG8eaL2mp.stderr
-rw-r--r-- 1 root root 222 Sep 15 16:33 tmp.77tm1d6hgK.stderr
-rw-r--r-- 1 root root 0 Sep 15 16:33 tmp.LN62mPrfFD.stderr
-rw-r--r-- 1 root root 53883 Sep 15 16:33 tmp.OdhwrXRMka.stdout
-rw-r--r-- 1 root root 46139 Sep 15 16:33 tmp.eb7YkKE9zD.stdout
-rw-r--r-- 1 root root 42427 Sep 15 16:33 tmp.YAG8eaL2mp.stdout
-rw-r--r-- 1 root root 51515 Sep 15 16:33 tmp.77tm1d6hgK.stdout
-rw-r--r-- 1 root root 41147 Sep 15 16:33 tmp.L2OVGAD4uM.stdout
-rw-r--r-- 1 root root 76667 Sep 15 16:33 tmp.LN62mPrfFD.stdout
With some further digging, it looks like the sender process is attempting to write after the receiver thinks it is finished:
[worker 0] stderr: debug1: Sending environment.^M
[worker 0] stderr: debug1: Sending env LANG = en_AU.UTF-8^M
[worker 0] stderr: debug1: Sending command: /root/mypython /tmp/tmp.T963KHKglF tmpserver /dev/vg-decrypted-ssd/prod-REMOVED-bastion-disk -b 1048576 -d 3 -1 sha512 ^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype exit-status reply 0^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0^M
[worker 0] stderr: debug1: channel 0: free: client-session, nchannels 1^M
[worker 0] stderr: debug1: fd 0 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 1 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 2 clearing O_NONBLOCK^M
[worker 0] stderr: Transferred: sent 5382364, received 107348 bytes, in 48.3 seconds^M
[worker 0] stderr: Bytes per second: sent 111324.0, received 2220.3^M
[worker 0] stderr: debug1: Exit status 0^M
[worker 0] stderr: debug1: compress outgoing: raw data 24142745, compressed 5336923, factor 0.22^M
[worker 0] stderr: debug1: compress incoming: raw data 90032, compressed 87626, factor 0.97^M
It looks like the ssh finishes normally, but the sender is continuing to send.
The error seems to be on the very last block:
[worker 0] Unexpected error on block 1195
It loops over and writes every other block, and then the last block the receiver has already closed. Presumably after writing the hash.
It also fails when sending a file to a localhost path, so it is not tied to SSH.
Once a worker has finished, all other workers fails immediately.
Here are the output I receive when using 8 workers (when issuing python /root/blocksync.py -w 8 -T "/var/lib/libvirt/images/dev.qcow2" localhost "/nas/vm/.working.dev.qcow2"
)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!! PRESS CTRL-C NOW TO EXIT !!!
!!! !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Starting worker #7 (pid: 953229)
[worker 7] Block size is 1.0 MB
[worker 7] Local fadvise: DONTNEED
[worker 7] Chunk size is 32004.9 MB, offset is 234917093376
[worker 7] Hash 1: sha512
[worker 7] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 7] Remote fadvise: None
[worker 7] Start syncing 32005 blocks...
[worker 7] same: 1944, diff: 30061, 32005/32005, 6.6 MB/s
[worker 7] Completed in 1:20:18
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!! PRESS CTRL-C NOW TO EXIT !!!
!!! !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Starting worker #0 (pid: 953206)
[worker 0] Block size is 1.0 MB
[worker 0] Local fadvise: DONTNEED
[worker 0] Chunk size is 32004.9 MB, offset is 0
[worker 0] Hash 1: sha512
[worker 0] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 0] Remote fadvise: None
[worker 0] Start syncing 32005 blocks...
Traceback (most recent call last):
File "/root/blocksync.py", line 415, in <module>
sync(i, srcdev, dsthost, dstdev, options)
File "/root/blocksync.py", line 318, in sync
p_in.write(DIFF)
BrokenPipeError: [Errno 32] Broken pipe
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!! PRESS CTRL-C NOW TO EXIT !!!
!!! !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Worker #7 exited with 0
Worker #0 exited with 256
Worker #0 caused ABORT
Terminating worker #1
Terminating worker #2
Terminating worker #3
Terminating worker #4
Terminating worker #5
Terminating worker #6
Terminating worker #7
Traceback (most recent call last):
File "/root/blocksync.py", line 384, in <module>
server(dstdev, False, options)
File "/root/blocksync.py", line 130, in server
stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/root/blocksync.py", line 384, in <module>
server(dstdev, False, options)
File "/root/blocksync.py", line 130, in server
stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/root/blocksync.py", line 384, in <module>
server(dstdev, False, options)
File "/root/blocksync.py", line 130, in server
stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/root/blocksync.py", line 384, in <module>
server(dstdev, False, options)
File "/root/blocksync.py", line 130, in server
stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/root/blocksync.py", line 431, in <module>
os.kill(pid, signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process
Traceback (most recent call last):
File "/root/blocksync.py", line 384, in <module>
server(dstdev, False, options)
File "/root/blocksync.py", line 130, in server
stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/root/blocksync.py", line 384, in <module>
server(dstdev, False, options)
File "/root/blocksync.py", line 130, in server
stdout.flush()
IOError: [Errno 32] Broken pipe
This happen when starting the process with nohup
or in a cron job. When starting directly in a shell, this completes successfuly.
Actually, it only takes a stdout redirection to reproduce the issue.
This command line has the exact same problem:
python /root/blocksync.py -w 4 -T src localhost dest >/dev/null
Seems like the first fork that finish its blocks close its file descriptor and other forks ends up with broken pipes on their stdin/stdout.
@ct16k @theraser any clue on this ?
As this project seems unmaintained, I allow myself to promote one of my tool that I wrote to replace blocksync and that mainly solve this issue. I didn't find a way to fix this issue by patching this project unfortunately and I needed to start a new project.
So if this issue still impacts some people, you can check deltasync.
It has also some performance improvement to maximize IO bandwith and CPU usage to compute the checksum, so you should see an improvement in the effective bandwith too.
There is also https://github.com/nethappen/blocksync-fast
If I run a script in crontab using blocksync with --workers it always fails:
Environment: