Open novosirj opened 4 years ago
Hi Ryan,
Sorry for the delay. I've tried to predicate your problem on a couple of machines using different interfaces and networks and I'm afraid I can't.
The child rsyncs always end before the parent pfp. If I kill the parent pfp manually (via ^C) or explicit kill from another term, all the rsyncs die with it: [[ INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.5]..
INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.6]..
INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.7]..
| Elapsed | 1m | [ wlp3s0] MB/s | Running || Susp'd |
Chunks [2020-05-18]
Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo] X11 forwarding request failed on channel 0 12.31.59 0.05 1.01 1.28 / 0.00 8 <> 0 [8] of [24] 12.32.03 0.12 1.01 1.12 / 0.00 8 <> 0 [14] of [24] 12.32.06 0.17 1.01 1.14 / 0.00 8 <> 0 [14] of [24] 12.32.09 0.22 1.25 1.11 / 0.00 8 <> 0 [14] of [24] 12.32.12 0.27 1.25 1.12 / 0.00 8 <> 0 [14] of [24] 12.32.15 0.32 1.15 1.12 / 0.00 8 <> 0 [14] of [24] ^C rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32)rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [send er=3.1.3] rsync: [sender] write error: Broken pipe (32)
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3]rsync error: received SIGINT, SIGTER M, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3] rsync: [sender] write error: Broken pipe (32) ]]
this is the way these processes are suposed to work and in my hands that's the way they do work. Could the processes that you're detecting be from another pfp? I can't match the log IDs.
The fpart error is explained by it's author like this: [[ Hmmm... That error can only be triggered when using arbitrary values (fpart's option -a), which asks fpart not to crawl a FS but instead read lines containing something like :
size path
values. The error is then triggered when sscanf() fails reading a line. Is pfp using that option ? It might be interesting to check the exact log line with a tool that displays special characters (e.g. 'cat -bet logfile') to see if there is a whitespace or something else. Finally, it is possible to build fpart with the '--enable-debug' option. It will display info while crawling the filesystem. It may help us better understand what's happening. ]]
And that option is used when fpart is taking lists of files to generate chunks. It's quite possible that there's something in your list that it does't like, such as a weird file name (like '^s' or one of the many wocko names that are possible to create via random mouse events. (I have several like that).
You can recompile fpart to enable the extended debugging if you want to try to track that down. It is supposed to print out the offending name, so the fact that it isn't implies that it's a non-printable character or whitespace.
Let me know if you see the pfp errors in other contexts or if you verify that the surviving rsyncs are children of the parent pfp that exited.
harry
--
Harry Mangalam
Yes, the child rsyncs are absolutely children of the PFP process that's still running. We don't run multiples at all on this system, and only have been running one at a time otherwise on the system that does run more than one, and you can see that the command lines for the rsync
processes above correspond to the parsyncfp
command line.
I'll be starting up a new round tomorrow night. It looks like the critical line here is 928, so if there's some instrumentation it would be helpful to add here, let me know.
Re: fpart
, yes, we use the PFP option to read file sizes from a list. Here is our full command line for PFP:
/usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=5T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT
(I notice --dispose c
doesn't seem to work either, but maybe I'm not specifying that correctly?)
I do know that we have some filenames with junk in them -- mostly carriage returns at the end. I tried what you suggested on the fpart log and he's right:
[root@quorum03 .parsyncfp-backups-projectsc]# cat -bet fpcache/fpart.log.23.01.04_2020-05-12 | grep error
33 error parsing input values: ^I$
Do you know which part should be the problem part? The output makes it unclear whether it's the one before or after, but neither the f.30
file or f.31
seem to contain a ^I
:
...
32 Filled part #30: size = 5502396364451, 52938 file(s)$
33 error parsing input values: ^I$
34 Filled part #31: size = 4576145248049, 96973 file(s)$
Thanks for your assistance!
You're right - line 928: if ( $rPIDs eq "" && $sPIDs eq "" && $CUR_FPI >= $nbr_cur_fpc_fles && $FPART_RUNNING == 0 ) { is the exit test. If you could dump all the variables inside that loop to make sure that I'm using the right tests..?
I could easily see that I messed up the test, but I have trouble seeing how the children PIDs could escape death of their parent, unless the parent is started with a nohup signal or something like it, as described here: http://morningcoffee.io/killing-a-process-and-all-of-its-descendants.html In that case the children also inherit the nohup and can survive the death of their parent. Are you starting your script with something like that?
That same article does clarify that children DO NOT always die with their parents (so the universe remains intact with your results ;), but it's generally unusual).
I'll check what's happening to the '--dispose' option. I haven't been paying any attention to it since I added it.
Re: the fpart error - I assume that you're not going to see the weird filename in a chunkfile bc it was an error and therefore was not included in the chunking process. If that's a consideration,you can probably cause fpart to print the whole fully-qualified filename and then find out where it should have gone. Or bet that it won't crash rsync and bypass that name-checking...?
Harry
Harry
On Mon, May 18, 2020 at 1:57 PM Ryan Novosielski notifications@github.com wrote:
Yes, the child rsyncs are absolutely children of the PFP process that's still running. We don't run multiples at all on this system, and only have been running one at a time otherwise on the system that does run more than one, and you can see that the command lines for the rsync processes above correspond to the parsyncfp command line.
I'll be starting up a new round tomorrow night. It looks like the critical line here is 928, so if there's some instrumentation it would be helpful to add here, let me know.
Re: fpart, yes, we use the PFP option to read file sizes from a list. Here is our full command line for PFP:
/usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=5T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT
(I notice --dispose c doesn't seem to work either, but maybe I'm not specifying that correctly?)
I do know that we have some filenames with junk in them -- mostly carriage returns at the end. I tried what you suggested on the fpart log and he's right:
[root@quorum03 .parsyncfp-backups-projectsc]# cat -bet fpcache/fpart.log.23.01.04_2020-05-12 | grep error 33 error parsing input values: ^I$
Do you know which part should be the problem part? The output makes it unclear whether it's the one before or after, but neither the f.30 file or f.31 seem to contain a ^I
... 32 Filled part #30: size = 5502396364451, 52938 file(s)$ 33 error parsing input values: ^I$ 34 Filled part #31: size = 4576145248049, 96973 file(s)$
Thanks for your assistance!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/35#issuecomment-630430894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YYSYXDIYZH7KE2KIMDRSGOJ7ANCNFSM4NAD2HHA .
--
Harry Mangalam
I'll add prints of those PIDs, etc. before I run this again.
When this has gone well, I typically run my script via at
, and have a &
on the invocation of the script. On weeks where I'm manually waiting for last week's run to finish, or I'm trying to keep a closer eye on this, I'll run it like myscript.sh &
and then run disown %%
so that if my shell is dropped, the command will not. I suspect that may be similar to what you're seeing. I also notice that it's not typical that the rsync
processes will die if kill PFP, so that seems to agree. We don't run PFP itself inside that script with an &
or anything
I know at one time there was a bug where PFP wasn't careful to confirm that one of the rsync
PIDs wasn't reused by something else. Since this runs for several days on our system -- I believe this week Tuesday at 23:00 to sometime late Saturday -- there is more chance for that to happen. But I believe you already made changes in that area.
We'll see what it prints out this go 'round.
Re: the fpart
error, the filenames don't appear to crash rsync
. Another interesting thing that happens in this area (probably better for another ticket -- I only raised the fpart
problem at all in case it related to this early exit problem somehow) is that the rsync
processes that run within PFP complain about problem filenames, whereas the final rsync
that we run with --delete
does not. But I think the reason there is that somewhere upstream -- likely the fpart
step -- is dropping special characters, and the file doesn't exist at a name that does not contain them. The final rsync
is silent on these/does appear to transfer them.
This didn't happen on this run, though I ran it via at
without the &
. I may go back to the other way next week to see if that exposes whatever is going on.
Here's the output I did capture, at any rate:
Time to debug weird exit:
rPIDs: 14315 22586
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
Time to debug weird exit:
rPIDs: 14315 22586
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
02.54.30 262.62 4.96 60.26 / 0.00 2 <> 0 [33] of [33]
Time to debug weird exit:
rPIDs:
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
^[[1;34mINFO: Done. Please check the target to make sure
expected files are where they're supposed to be.
^[[0m^[[1;34mINFO:
The parsyncfp cache dir takes up [439M /root/.parsyncfp-backups-projectsc]
Don't forget to delete it, but wait until you are sure that your job
completed correctly, so you don't need the log files anymore.
Here's some output from our most recent run on one of our three campuses; I mentioned in Issue 34 that there seems to be a bug where sometimes PFP exits while rsync processes are still running, which is incorrect. You can see in the below output that PFP exists, and then there are these lines:
What is happening during that time is a
pgrep -x rsync
loop to ensure that there are no longer any rsyncs running that was added by my colleague Bill Abbott, probably to deal with this problem. As you can see, it was several hours before all of thersync
processes completed. Here's the full output:While this
rsync
loop was running, I checked for running rsync processes:As you can see, they are numerous.
We still have an error message in the
fpart.log
-- it's not clear to me whether it's related, or how I would go about figuring out what is upsetting it:But we have all of the cache files from the run, so we can look them over and run tests in the interim (the transfer finished very fast this week so I have till Tuesday night to tinker).