parsyncfp 1.67 and earlier sometimes exits before all rsync processes have completed

novosirj commented 4 years ago

Here's some output from our most recent run on one of our three campuses; I mentioned in Issue 34 that there seems to be a bug where sometimes PFP exits while rsync processes are still running, which is incorrect. You can see in the below output that PFP exists, and then there are these lines:

* TRANSFER projects parallel rsync wait BEGIN: 2020-05-12_23:41:59
* TRANSFER projects parallel rsync wait END: 2020-05-13_03:20:07

What is happening during that time is a pgrep -x rsync loop to ensure that there are no longer any rsyncs running that was added by my colleague Bill Abbott, probably to deal with this problem. As you can see, it was several hours before all of the rsync processes completed. Here's the full output:

* TRANSFER projects parallel BEGIN: 2020-05-12_23:01:04

^[[0m^[[1;35mWARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-backups-projectsc/fpcache].
  Enter ^C to stop this.
  If you specified '--nowait', cache will be cleared in 3s regardless.
  Otherwise, hit [Enter] and I'll clear them.
^[[0m^[[1;34mINFO: The fpart chunk files [/root/.parsyncfp-backups-projectsc/fpcache/f*] are cleared .. continuing.
^[[0m^[[1;34mINFO: The TRIMPATH you specified exists, is a dir, and is readable.
^[[0m^[[1;34mINFO: Alternative file list is readable; converting list to chunks.
^[[0m^[[1;34mINFO: Forking fpart with PID = [14688].  Check [/root/.parsyncfp-backups-projectsc/fpcache/fpart.log.23.01.04_2020-05-12] for errors if it hangs.
^[[0m^[[1;34mINFO: Starting the 1st [12] rsyncs ..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.0]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.1]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.2]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.3]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.4]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.5]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.6]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.7]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.8]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.9]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.10]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.11]..
^[[0m        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-05-12]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
23.11.52   40.67     6.41     320.95 / 0.00            12    <>   0          [27] of [32]
^[[1;34mINFO: Done.  Please check the target to make sure
  expected files are where they're supposed to be.
^[[0m^[[1;34mINFO:
    The parsyncfp cache dir takes up [277M      /root/.parsyncfp-backups-projectsc]
    Don't forget to delete it, but wait until you are sure that your job
    completed correctly, so you don't need the log files anymore.

Reminder: check the parsyncfp log:
  [/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31]
and the fpart log:
  [/root/.parsyncfp-backups-projectsc/fpcache/fpart.log.23.01.04_2020-05-12]
for errors. Use '--verbose=1' for less output.
You transferred [1.28184e+14 bytes = 116.58267 TB] via all [12] rsyncs.

      Thanks for using parsyncfp. Tell me how to make it better.
                       <hjmangalam@gmail.com>

^[[0m
* TRANSFER projects parallel END: 2020-05-12_23:41:54

* TRANSFER projects parallel rsync wait BEGIN: 2020-05-12_23:41:59
* TRANSFER projects parallel rsync wait END: 2020-05-13_03:20:07

While this rsync loop was running, I checked for running rsync processes:

[root@quorum03 ~]# ps -ef | grep rsyc
root      7548  4342  0 00:09 pts/1    00:00:00 grep --color=auto rsyc
[root@quorum03 ~]# ps -ef | grep rsync
root      4151     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"
  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.27  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4153  4151 45 May12 ?        00:12:47 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-project
sc/fpcache/f.27 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4158  4153 33 May12 ?        00:09:17 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4164     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"
  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.28  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4166  4164 43 May12 ?        00:12:15 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-project
sc/fpcache/f.28 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4167  4166 31 May12 ?        00:08:44 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4177     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"
  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.29  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4179  4177 44 May12 ?        00:12:23 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-project
sc/fpcache/f.29 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4180  4179 30 May12 ?        00:08:38 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4189     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4191  4189 45 May12 ?        00:12:37 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4192  4191 29 May12 ?        00:08:15 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4201     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4203  4201 41 May12 ?        00:11:35 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4204  4203 25 May12 ?        00:07:13 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      7584  4342  0 00:09 pts/1    00:00:00 grep --color=auto rsync
root     11262 11259  0 May12 ?        00:00:00 /bin/bash -x /usr/local/bin/asb-parsyncfp-0.9.sh
root     14280     1  0 May05 ?        00:00:29 perl /usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no" --maxload 96 --chunksize=10T --fromlist=/gpfs/cache/home/root/policy/projectsc.list.allfiles.clean --trimpath=/projects/.snapshots/projectsc --trustme nas1:/zdata/gss/projectsc
root     25307  6185  0 May12 pts/0    00:00:00 tail -f /var/log/asb-parsyncfp-20200512.log /var/log/asb-parsyncfp-20200512.err
[root@quorum03 ~]# ps -ef | grep rsync | grep f\...
root      4151     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.27  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4153  4151 45 May12 ?        00:13:15 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.27 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4164     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.28  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4166  4164 43 May12 ?        00:12:42 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.28 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4177     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.29  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4179  4177 44 May12 ?        00:12:51 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.29 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4189     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4191  4189 45 May12 ?        00:13:05 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.30 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4201     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4203  4201 41 May12 ?        00:12:02 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.31 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root     11262 11259  0 May12 ?        00:00:00 /bin/bash -x /usr/local/bin/asb-parsyncfp-0.9.sh
root     14280     1  0 May05 ?        00:00:29 perl /usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no" --maxload 96 --chunksize=10T --fromlist=/gpfs/cache/home/root/policy/projectsc.list.allfiles.clean --trimpath=/projects/.snapshots/projectsc --trustme nas1:/zdata/gss/projectsc
root     25307  6185  0 May12 pts/0    00:00:00 tail -f /var/log/asb-parsyncfp-20200512.log /var/log/asb-parsyncfp-20200512.err
[root@quorum03 ~]# ps -ef | grep rsync | grep f\.3.
root      4189     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4191  4189 45 May12 ?        00:13:09 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4201     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4203  4201 41 May12 ?        00:12:05 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc

As you can see, they are numerous.

We still have an error message in the fpart.log -- it's not clear to me whether it's related, or how I would go about figuring out what is upsetting it:

[root@quorum03 ~]# cat /root/.parsyncfp-backups-projectsc/fpcache/fpart.log.23.01.04_2020-05-12
Examining filesystem...
Filled part #0: size = 5522565562195, 39213 file(s)
Filled part #1: size = 5541427793660, 97253 file(s)
Filled part #2: size = 5545874130628, 60812 file(s)
Filled part #3: size = 5509769400795, 40104 file(s)
Filled part #4: size = 5499888680149, 56216 file(s)
Filled part #5: size = 5506487195394, 39990 file(s)
Filled part #6: size = 5506709107196, 30690 file(s)
Filled part #7: size = 5506735624280, 24304 file(s)
Filled part #8: size = 5499734697022, 49527 file(s)
Filled part #9: size = 5508541016486, 39374 file(s)
Filled part #10: size = 5497866496719, 55087 file(s)
Filled part #11: size = 6835297759526, 11085 file(s)
Filled part #12: size = 5497819066318, 28734 file(s)
Filled part #13: size = 5511722393336, 58762 file(s)
Filled part #14: size = 5505785371697, 19059 file(s)
Filled part #15: size = 5504441490693, 48614 file(s)
Filled part #16: size = 5512316116286, 39688 file(s)
Filled part #17: size = 5500649696382, 41383 file(s)
Filled part #18: size = 5512471694650, 52756 file(s)
Filled part #19: size = 5553281642311, 112159 file(s)
Filled part #20: size = 5501550103865, 54165 file(s)
Filled part #21: size = 5508397812239, 57320 file(s)
Filled part #22: size = 5577036109281, 23500 file(s)
Filled part #23: size = 5499498469383, 62866 file(s)
Filled part #24: size = 5538987318102, 21086 file(s)
Filled part #25: size = 5498235348656, 21008 file(s)
Filled part #26: size = 5498119523024, 29184 file(s)
Filled part #27: size = 5497635536928, 21893 file(s)
Filled part #28: size = 5501677782525, 44983 file(s)
Filled part #29: size = 5500261233014, 63159 file(s)
Filled part #30: size = 5502396364451, 52938 file(s)
error parsing input values: 
Filled part #31: size = 4576145248049, 96973 file(s)
1493885 file(s) found.

But we have all of the cache files from the run, so we can look them over and run tests in the interim (the transfer finished very fast this week so I have till Tuesday night to tinker).

hjmangalam commented 4 years ago

Hi Ryan,

Sorry for the delay. I've tried to predicate your problem on a couple of machines using different interfaces and networks and I'm afraid I can't.

The child rsyncs always end before the parent pfp. If I kill the parent pfp manually (via ^C) or explicit kill from another term, all the rsyncs die with it: [[ INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.5]..

INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.6]..

INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.7]..

   | Elapsed |   1m   |    [ wlp3s0]   MB/s  | Running || Susp'd  |
 Chunks       [2020-05-18]

Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo] X11 forwarding request failed on channel 0 12.31.59 0.05 1.01 1.28 / 0.00 8 <> 0 [8] of [24] 12.32.03 0.12 1.01 1.12 / 0.00 8 <> 0 [14] of [24] 12.32.06 0.17 1.01 1.14 / 0.00 8 <> 0 [14] of [24] 12.32.09 0.22 1.25 1.11 / 0.00 8 <> 0 [14] of [24] 12.32.12 0.27 1.25 1.12 / 0.00 8 <> 0 [14] of [24] 12.32.15 0.32 1.15 1.12 / 0.00 8 <> 0 [14] of [24] ^C rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32)rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [send er=3.1.3] rsync: [sender] write error: Broken pipe (32)

rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3]rsync error: received SIGINT, SIGTER M, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) ]]

this is the way these processes are suposed to work and in my hands that's the way they do work. Could the processes that you're detecting be from another pfp? I can't match the log IDs.

The fpart error is explained by it's author like this: [[ Hmmm... That error can only be triggered when using arbitrary values (fpart's option -a), which asks fpart not to crawl a FS but instead read lines containing something like :

size path

values. The error is then triggered when sscanf() fails reading a line. Is pfp using that option ? It might be interesting to check the exact log line with a tool that displays special characters (e.g. 'cat -bet logfile') to see if there is a whitespace or something else. Finally, it is possible to build fpart with the '--enable-debug' option. It will display info while crawling the filesystem. It may help us better understand what's happening. ]]

And that option is used when fpart is taking lists of files to generate chunks. It's quite possible that there's something in your list that it does't like, such as a weird file name (like '^s' or one of the many wocko names that are possible to create via random mouse events. (I have several like that).

You can recompile fpart to enable the extended debugging if you want to try to track that down. It is supposed to print out the offending name, so the fact that it isn't implies that it's a non-printable character or whitespace.

Let me know if you see the pfp errors in other contexts or if you verify that the surviving rsyncs are children of the parent pfp that exited.

harry

--

Harry Mangalam

novosirj commented 4 years ago

Yes, the child rsyncs are absolutely children of the PFP process that's still running. We don't run multiples at all on this system, and only have been running one at a time otherwise on the system that does run more than one, and you can see that the command lines for the rsync processes above correspond to the parsyncfp command line.

I'll be starting up a new round tomorrow night. It looks like the critical line here is 928, so if there's some instrumentation it would be helpful to add here, let me know.

Re: fpart, yes, we use the PFP option to read file sizes from a list. Here is our full command line for PFP:

/usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=5T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT

(I notice --dispose c doesn't seem to work either, but maybe I'm not specifying that correctly?)

I do know that we have some filenames with junk in them -- mostly carriage returns at the end. I tried what you suggested on the fpart log and he's right:

[root@quorum03 .parsyncfp-backups-projectsc]# cat -bet fpcache/fpart.log.23.01.04_2020-05-12 | grep error
    33  error parsing input values: ^I$

Do you know which part should be the problem part? The output makes it unclear whether it's the one before or after, but neither the f.30 file or f.31 seem to contain a ^I:

...
    32  Filled part #30: size = 5502396364451, 52938 file(s)$
    33  error parsing input values: ^I$
    34  Filled part #31: size = 4576145248049, 96973 file(s)$

Thanks for your assistance!

hjmangalam commented 4 years ago

You're right - line 928: if ( $rPIDs eq "" && $sPIDs eq "" && $CUR_FPI >= $nbr_cur_fpc_fles && $FPART_RUNNING == 0 ) { is the exit test. If you could dump all the variables inside that loop to make sure that I'm using the right tests..?

I could easily see that I messed up the test, but I have trouble seeing how the children PIDs could escape death of their parent, unless the parent is started with a nohup signal or something like it, as described here: http://morningcoffee.io/killing-a-process-and-all-of-its-descendants.html In that case the children also inherit the nohup and can survive the death of their parent. Are you starting your script with something like that?

That same article does clarify that children DO NOT always die with their parents (so the universe remains intact with your results ;), but it's generally unusual).

I'll check what's happening to the '--dispose' option. I haven't been paying any attention to it since I added it.

Re: the fpart error - I assume that you're not going to see the weird filename in a chunkfile bc it was an error and therefore was not included in the chunking process. If that's a consideration,you can probably cause fpart to print the whole fully-qualified filename and then find out where it should have gone. Or bet that it won't crash rsync and bypass that name-checking...?

Harry

On Mon, May 18, 2020 at 1:57 PM Ryan Novosielski notifications@github.com wrote:

Yes, the child rsyncs are absolutely children of the PFP process that's still running. We don't run multiples at all on this system, and only have been running one at a time otherwise on the system that does run more than one, and you can see that the command lines for the rsync processes above correspond to the parsyncfp command line.

I'll be starting up a new round tomorrow night. It looks like the critical line here is 928, so if there's some instrumentation it would be helpful to add here, let me know.

Re: fpart, yes, we use the PFP option to read file sizes from a list. Here is our full command line for PFP:

/usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=5T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT

(I notice --dispose c doesn't seem to work either, but maybe I'm not specifying that correctly?)

I do know that we have some filenames with junk in them -- mostly carriage returns at the end. I tried what you suggested on the fpart log and he's right:

[root@quorum03 .parsyncfp-backups-projectsc]# cat -bet fpcache/fpart.log.23.01.04_2020-05-12 | grep error 33 error parsing input values: ^I$

Do you know which part should be the problem part? The output makes it unclear whether it's the one before or after, but neither the f.30 file or f.31 seem to contain a ^I

... 32 Filled part #30: size = 5502396364451, 52938 file(s)$ 33 error parsing input values: ^I$ 34 Filled part #31: size = 4576145248049, 96973 file(s)$

Thanks for your assistance!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/35#issuecomment-630430894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YYSYXDIYZH7KE2KIMDRSGOJ7ANCNFSM4NAD2HHA .

--

Harry Mangalam

novosirj commented 4 years ago

I'll add prints of those PIDs, etc. before I run this again.

When this has gone well, I typically run my script via at, and have a & on the invocation of the script. On weeks where I'm manually waiting for last week's run to finish, or I'm trying to keep a closer eye on this, I'll run it like myscript.sh & and then run disown %% so that if my shell is dropped, the command will not. I suspect that may be similar to what you're seeing. I also notice that it's not typical that the rsync processes will die if kill PFP, so that seems to agree. We don't run PFP itself inside that script with an & or anything

I know at one time there was a bug where PFP wasn't careful to confirm that one of the rsync PIDs wasn't reused by something else. Since this runs for several days on our system -- I believe this week Tuesday at 23:00 to sometime late Saturday -- there is more chance for that to happen. But I believe you already made changes in that area.

We'll see what it prints out this go 'round.

Re: the fpart error, the filenames don't appear to crash rsync. Another interesting thing that happens in this area (probably better for another ticket -- I only raised the fpart problem at all in case it related to this early exit problem somehow) is that the rsync processes that run within PFP complain about problem filenames, whereas the final rsync that we run with --delete does not. But I think the reason there is that somewhere upstream -- likely the fpart step -- is dropping special characters, and the file doesn't exist at a name that does not contain them. The final rsync is silent on these/does appear to transfer them.

novosirj commented 4 years ago

This didn't happen on this run, though I ran it via at without the &. I may go back to the other way next week to see if that exposes whatever is going on.

Here's the output I did capture, at any rate:

Time to debug weird exit:
rPIDs: 14315 22586
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
Time to debug weird exit:
rPIDs: 14315 22586
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
02.54.30   262.62     4.96      60.26 / 0.00             2    <>   0          [33] of [33]
Time to debug weird exit:
rPIDs:
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
^[[1;34mINFO: Done.  Please check the target to make sure
  expected files are where they're supposed to be.
^[[0m^[[1;34mINFO:
    The parsyncfp cache dir takes up [439M      /root/.parsyncfp-backups-projectsc]
    Don't forget to delete it, but wait until you are sure that your job
    completed correctly, so you don't need the log files anymore.

hjmangalam / parsyncfp

parsyncfp 1.67 and earlier sometimes exits before all rsync processes have completed #35