fpart and parasync mismatching parts

hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf

Other

161 stars 19 forks source link

fpart and parasync mismatching parts #30

Open ADiEmme opened 4 years ago

ADiEmme commented 4 years ago

Hi all,

We are facing a weird problem. We have our parsync job running for 18h (which might be ok, considering the amount of data we're trying to backup: 440TB) Since it looked suspicious, i ran an strace on parsync and this is what i noticed: [root@filler002 fpcache]# strace -ff -tt -s 4096 -p 2858510 strace: Process 2858510 attached 09:05:02.861749 restart_syscall(<... resuming interrupted nanosleep ...>) = 0 09:05:04.483017 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:04.483408 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:04.483648 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:04.483803 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:04.483928 nanosleep({2, 0}, 0x7fffffffbfe0) = 0 09:05:06.484248 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:06.484480 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:06.484641 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:06.484823 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:06.484931 nanosleep({2, 0}, 0x7fffffffbfe0) = 0 09:05:08.485355 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:08.485584 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:08.485735 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:08.485849 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:08.485963 nanosleep({2, 0}, strace: Process 2858510 detached

So i said, ok, maybe it's waiting for the generation of the parts. Then i checked fpart processes and logs: [root@filler002 fpcache]# pwd /root/.parsyncfp-TMC/fpcache [root@filler002 fpcache]# ps aux | grep -i fpart root 2992692 0.0 0.0 112716 984 pts/1 S+ 09:05 0:00 grep --color=auto -i fpart [root@filler002 fpcache]# ls -lah total 35M drwxr-xr-x 2 root root 4.0K Nov 11 14:05 . drwxr-xr-x 3 root root 4.0K Nov 11 14:05 .. -rw-r--r-- 1 root root 351K Nov 11 13:54 f.0 -rw-r--r-- 1 root root 930K Nov 11 13:54 f.1 -rw-r--r-- 1 root root 1.1M Nov 11 13:58 f.10 -rw-r--r-- 1 root root 2.4M Nov 11 13:59 f.11 -rw-r--r-- 1 root root 2.0M Nov 11 13:59 f.12 -rw-r--r-- 1 root root 1.6M Nov 11 13:59 f.13 -rw-r--r-- 1 root root 2.7M Nov 11 13:59 f.14 -rw-r--r-- 1 root root 1.9M Nov 11 14:00 f.15 -rw-r--r-- 1 root root 1.2M Nov 11 14:00 f.16 -rw-r--r-- 1 root root 3.0M Nov 11 14:00 f.17 -rw-r--r-- 1 root root 1.8M Nov 11 14:00 f.18 -rw-r--r-- 1 root root 803K Nov 11 14:01 f.19 -rw-r--r-- 1 root root 12K Nov 11 13:55 f.2 -rw-r--r-- 1 root root 1.9M Nov 11 14:02 f.20 -rw-r--r-- 1 root root 567K Nov 11 14:02 f.21 -rw-r--r-- 1 root root 1.8M Nov 11 14:03 f.22 -rw-r--r-- 1 root root 2.7M Nov 11 14:03 f.23 -rw-r--r-- 1 root root 1.1M Nov 11 14:04 f.24 -rw-r--r-- 1 root root 1.6M Nov 11 14:05 f.25 -rw-r--r-- 1 root root 1.2M Nov 11 13:56 f.3 -rw-r--r-- 1 root root 493K Nov 11 13:57 f.4 -rw-r--r-- 1 root root 721K Nov 11 13:57 f.5 -rw-r--r-- 1 root root 14K Nov 11 13:57 f.6 -rw-r--r-- 1 root root 64K Nov 11 13:58 f.7 -rw-r--r-- 1 root root 452K Nov 11 13:58 f.8 -rw-r--r-- 1 root root 2.6M Nov 11 13:58 f.9 -rw-r--r-- 1 root root 1.4K Nov 11 14:06 fpart.log.13.54.12_2019-11-11 -rw-r--r-- 1 root root 8 Nov 11 13:54 FP_PIDFILE13.54.12_2019-11-11 -rw-r--r-- 1 root root 208 Nov 11 14:05 rsync-PIDs-13.54.12_2019-11-11

[root@filler002 fpcache]# cat fpart.log.13.54.12_2019-11-11 Examining filesystem... Filled part #0: size = 488118230349, 119714 file(s) Filled part #1: size = 488126329044, 117869 file(s) Filled part #2: size = 488119026333, 329636 file(s) Filled part #3: size = 488195328885, 189742 file(s) Filled part #4: size = 489718855986, 126551 file(s) Filled part #5: size = 523602512613, 92303 file(s) Filled part #6: size = 507470117001, 26582 file(s) Filled part #7: size = 564502186104, 115597 file(s) Filled part #8: size = 552879767274, 66996 file(s) Filled part #9: size = 488970973281, 39277 file(s) Filled part #10: size = 506558236575, 120349 file(s) Filled part #11: size = 488124669226, 60410 file(s) Filled part #12: size = 507594913183, 64456 file(s) Filled part #13: size = 488121789469, 71924 file(s) Filled part #14: size = 488117693970, 65408 file(s) Filled part #15: size = 546032339031, 66658 file(s) Filled part #16: size = 493759970422, 73142 file(s) Filled part #17: size = 489034017302, 154668 file(s) Filled part #18: size = 492438162214, 178504 file(s) Filled part #19: size = 488127491874, 109397 file(s) Filled part #20: size = 488210804913, 136003 file(s) Filled part #21: size = 488125298794, 98261 file(s) Filled part #22: size = 505097570795, 113938 file(s) Filled part #23: size = 488117458189, 227673 file(s) Filled part #24: size = 577134074354, 234131 file(s) Filled part #25: size = 386946098762, 157991 file(s) 3157180 file(s) found.

It looks like here's a mismatch, since parsync is waiting for the 26th file but that's not being created. Have you ever faced this kind of situation before?

Something about our environment: [root@filler002 fpcache]# ps aux | grep -i parsyncfp root 2858510 0.0 0.0 137324 6100 ? S Nov11 0:04 perl /usr/bin/parsyncfp --verbose=2 --nowait --NP=8 --ro -aAX --chunksize=-488116876476 --altcache=/root/.parsyncfp-TMC --startdir=/mnt/beegfs/data TMC /mnt/cephfs/backup/data root 2992869 0.0 0.0 112716 984 pts/1 S+ 09:06 0:00 grep --color=auto -i parsyncfp [root@filler002 fpcache]# uname -r 3.10.0-1062.4.1.el7.x86_64 [root@filler002 fpcache]# cat /etc/redhat-release CentOS Linux release 7.7.1908 (Core) [root@filler002 fpcache]# parsyncfp -V Option v requires an argument

parsyncfp version 1.60 (Dullnig) June 24, 2019 by Harry Mangalam hjmangalam@gmail.com

How do you suggest to move forward from this point on?

Thank you in advance!

hjmangalam commented 4 years ago

Hmm - that's a steaming big chunk of data. For my own info, what kind of network are you using and what kind of bandwidth are you seeing? And is this the first time you've used it or is this an unexpected glitch in an otherwise routine use of it?

Also, thank you for providing such a nicely informative question. As someone who spent most of his life answering poorly worded questions, I appreciate it.

There are a few reasons this might be the case.

1 - parsyncfp (pfp) does check whether it has overrun fpart and will wait for it - you may have seen the error message indicating this - often seen on mostly rsync'ed transfers bc there are NP rsync processes waiting to consume the efforts of a single fpart. There could be a bug in this process, but I haven't seen it yet.

2 - if it's an active filesystem and something happened to the files during the fpart recursive descent, fpart may have gotten confused and coughed up a digital hairball instead of the final chunkfile. The files that were supposed to go in the final file were deleted or moved as the file was being prepped. I have seen one case where this (supposedly) happened. This isn't a bug per se, but a result of file jitter.

3 - there was an off-by-one error in an older version of pfp where the this might have been the case. You're using the 1.60 version, I have it up to 1.61.

Just in case, I'm including the latest version I have. Try that and see what happens.

Because it's rsync-based, you shouldn't have lost any time except for the (long) fpart recursion. ie it should be 25/26 finished.

Best harry

On Tuesday, November 12, 2019 12:08:29 AM PST Andrea Del Monaco wrote:

Hi all,

We are facing a weird problem. We have our parsync job running for 18h (which might be ok, considering the amount of data we're trying to backup: 440TB) Since it looked suspicious, i ran an strace on parsync and this is what i noticed: [root@filler002 fpcache]# strace -ff -tt -s 4096 -p 2858510 strace: Process 2858510 attached 09:05:02.861749 restart_syscall(<... resuming interrupted nanosleep ...>) = 0 09:05:04.483017 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:04.483408 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:04.483648 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:04.483803 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:04.483928 nanosleep({2, 0}, 0x7fffffffbfe0) = 0 09:05:06.484248 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:06.484480 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:06.484641 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:06.484823 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:06.484931 nanosleep({2, 0}, 0x7fffffffbfe0) = 0 09:05:08.485355 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:08.485584 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:08.485735 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:08.485849 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:08.485963 nanosleep({2, 0}, strace: Process 2858510 detached

So i said, ok, maybe it's waiting for the generation of the parts. Then i checked fpart processes and logs: [root@filler002 fpcache]# pwd /root/.parsyncfp-TMC/fpcache [root@filler002 fpcache]# ps aux | grep -i fpart root 2992692 0.0 0.0 112716 984 pts/1 S+ 09:05 0:00 grep --color=auto -i fpart [root@filler002 fpcache]# ls -lah

ADiEmme commented 4 years ago

Hi Harry,

Thank you for your fast reply! :)

Getting back to the issue, i tried to see the logs, but i don't see anything wrong: [root@bezavrdat-master01 logs]# tac parsync-60784

exit 0
rm -rf /root/.parsyncfp-prius Job finished well, and happy: cleaning up.
printf '%s\n' 'Job finished well, and happy: cleaning up.'
'[' 0 -eq 0 ']' hjmangalam@gmail.com Thanks for using parsyncfp. Tell me how to make it better.

if there were errors. Use '--verbose=1' for less output. and the fpart log [/root/.parsyncfp-prius/fpcache/fpart.log.12.56.27_2019-11-12] Reminder: check the parsyncfp log [/root/.parsyncfp-prius/rsync-logfile-12.56.27_2019-11-12_8]

completed correctly, so you don't need the log files anymore. Don't forget to delete it, but wait until you are sure that your job The parsyncfp cache dir takes up [4.2M /root/.parsyncfp-prius] INFO: expected files are where they're supposed to be. INFO: Done. Please check the target to make sure 12.57.15 0.05 1.85 0.00 / 0.00 2 <> 0 [8] of [8] Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo] | Elapsed | 1m | [ enp3s0] MB/s | Running || Susp'd | Chunks [2019-11-12] INFO: Starting the 1st [8] rsyncs .. INFO: Forking fpart. Check [/root/.parsyncfp-prius/fpcache/fpart.log.12.56.27_2019-11-12] for errors if it hangs. INFO: The fpart chunk files [/root/.parsyncfp-prius/fpcache/f*] are cleared .. continuing. Otherwise, hit [Enter] and I'll clear them. If you specified '--nowait', cache will be cleared in 3s regardless. Enter ^C to stop this. WARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-prius/fpcache].
/usr/bin/parsyncfp --verbose=2 --nowait --NP=8 --ro -aA --chunksize=-489178234094 --altcache=/root/.parsyncfp-prius --startdir=/mnt/beegfs/data prius /mnt/cephfs/backup/data Found some left over files in /root/.parsyncfp-prius.
printf '%s\n' 'Found some left over files in /root/.parsyncfp-prius.'
'[' -d /root/.parsyncfp-prius ']'
dest_dir=data
base_dir=prius ++ basename /mnt/beegfs/data/prius
start_dir=/mnt/beegfs/data ++ dirname /mnt/beegfs/data/prius
chunck=489178234094 ++ printf %.0f 489178234093.56 ++ LC_NUMERIC=en_US.UTF-8 +++ echo 'scale=2;477713119232*1024/1000' ++++ tail -n1 ++++ df --output=used /mnt/beegfs +++ bc

(i used tac instead of cat, so it's backwards - that's because the log file was so big) You will see that in the end completed - that's because i did touch f.x so it would quit the loop.

Regarding the chunk size: maybe some advice would be good. We have ~400TB of data to move around, therefore we did 400TB / 1000(number of parts). Do you think that's the right approach? At 2k we get the warning, so better not to risk it?

Another thing: actually, we have two shares mounted locally, therefore we don't use any network.

2 - if it's an active filesystem and something happened to the files during the fpart recursive descent, fpart may have gotten confused and coughed up a digital hairball instead of the final chunkfile. The files that were supposed to go in the final file were deleted or moved as the file was being prepped. I have seen one case where this (supposedly) happened. This isn't a bug per se, but a result of file jitter. What can we do in this case?

Meanwhile, i am trying another run using version 1.61 - hope this works!

ADiEmme commented 4 years ago

Short update: I just restarted using 1.61 and outcome didn't change. But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to have improved the situation. Any idea why this could happen?

hjmangalam commented 4 years ago

On Wednesday, November 13, 2019 1:03:12 AM PST Andrea Del Monaco wrote:

Short update: I just restarted using 1.61 and outcome didn't change. But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to have improved the situation. Any idea why this could happen?

The size and number of chunk files should be in the range that allows pfp to start up fast and keep going, and not so many that you end up with a bazillion chunks and large logs.
It depends on the size of your data and the number of files; it also has implications for the number of stat()s that have to be done to track the chun files, so that argues for a lower number as well.

Obviously, if you define few chunk files with a large data source, it will take a long time for fpart to fill all the chunks, leading to delayed startup.

So I somewhat arbitrarily defined 2K as the limit for reasons of efficiency in scanning the single dir.

But changing the # of chunks to 2K from 1K (if that's what you meant) should not have made a big change.

Harry Mangalam, Info[1]

[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

hjmangalam commented 4 years ago

On Wednesday, November 13, 2019 1:03:12 AM PST Andrea Del Monaco wrote:

Short update: I just restarted using 1.61 and outcome didn't change. But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to have improved the situation. Any idea why this could happen?

So your job (using 1.61) still hung at the end of the run? That's very odd. I'm traveling and short of time today, but after I get home, I'd like to take this up again.

Harry Mangalam, Info[1]

[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

novosirj commented 4 years ago

I'm seeing something a little similar, and it appears to have only started in the last couple of weeks. I can't recall if I upgraded or not recently; I should have kept track. Here's my fpart.log:

[root@quorum03 fpcache]# cat fpart.log.23.01.14_2020-04-21
Examining filesystem...
Filled part #0: size = 11033969843194, 137378 file(s)
Filled part #1: size = 11010341095020, 96042 file(s)
Filled part #2: size = 10995456710513, 99478 file(s)
Filled part #3: size = 10997042249668, 57445 file(s)
Filled part #4: size = 10996205514131, 83592 file(s)
Filled part #5: size = 11235699669681, 72848 file(s)
Filled part #6: size = 10999193108455, 78821 file(s)
Filled part #7: size = 10999539432427, 35181 file(s)
Filled part #8: size = 11006485913550, 81898 file(s)
Filled part #9: size = 10995307950334, 109058 file(s)
Filled part #10: size = 11003985383714, 151601 file(s)
Filled part #11: size = 11123773319829, 99136 file(s)
Filled part #12: size = 10999398939451, 63104 file(s)
Filled part #13: size = 10997135371653, 47733 file(s)
Filled part #14: size = 10995837809948, 111756 file(s)
error parsing input values:
Filled part #15: size = 7276234302256, 105608 file(s)
1430679 file(s) found.

I'm not sure I checked this file the previous times, so I can't tell you if the problem correlates with the above error. It seems to me I've seen the latter without the former, but I'm not sure.

Then I'm seeing this in the parsyncfp output:

        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-04-21]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
05.34.05   422.72     1.72       5.00 / 0.00             3    <>   0          [16] of [16]
 INFO: Waiting [31502]s for next chunkfile..

In my experience, it will never finish waiting, and will increment the time forever. I've been just killing off the parsyncfp script at the end when there are no more rsyncs, but obviously it would be better to avoid that.

hjmangalam commented 4 years ago

Hi Ryan, Apologies for the delay. I think that error is coming from fpart, not parsycfp and as such, I'm unfamiliar with the conditions that generated it. I'll get in touch with Ganael (fpart author) and see what he says.

There are filename conditions that will cause even rsync to hack up a hairball occasionally, so it's probable that a similar situation could cause fpart to stutter. the fact that it started a little while ago and keeps repeating seems to imply that you have a file or dir somewhere that's causing fpart to die on it. The log entry you included (thank you): [[ Filled part #13: size = 10997135371653, 47733 file(s) Filled part #14: size = 10995837809948, 111756 file(s) error parsing input values: <<<<< Filled part #15: size = 7276234302256, 105608 file(s) 1430679 file(s) found. ]] at the '<<<<<' indicates a problem, but I'll have to debug with Ganael to see what could be causing it. The log does imply that fpart continued to run and fill another chunkfile correctly, but I'm not sure if the errors are being emitted and captured synchronously or not. The last chunkfile correctly (14 or 15 in the log above) filled should give an indication of what dir was being traversed when it died. Can you examine that dir and see if there's an oddball filename or damaged file lurking there?

Best harry

On Wed, Apr 22, 2020 at 11:55 AM Ryan Novosielski notifications@github.com wrote:

I'm seeing something a little similar, and it appears to have only started in the last couple of weeks. I can't recall if I upgraded or not recently; I should have kept track. Here's my fpart.log:

[root@quorum03 fpcache]# cat fpart.log.23.01.14_2020-04-21 Examining filesystem... Filled part #0: size = 11033969843194, 137378 file(s) Filled part #1: size = 11010341095020, 96042 file(s) Filled part #2: size = 10995456710513, 99478 file(s) Filled part #3: size = 10997042249668, 57445 file(s) Filled part #4: size = 10996205514131, 83592 file(s) Filled part #5: size = 11235699669681, 72848 file(s) Filled part #6: size = 10999193108455, 78821 file(s) Filled part #7: size = 10999539432427, 35181 file(s) Filled part #8: size = 11006485913550, 81898 file(s) Filled part #9: size = 10995307950334, 109058 file(s) Filled part #10: size = 11003985383714, 151601 file(s) Filled part #11: size = 11123773319829, 99136 file(s) Filled part #12: size = 10999398939451, 63104 file(s) Filled part #13: size = 10997135371653, 47733 file(s) Filled part #14: size = 10995837809948, 111756 file(s) error parsing input values: Filled part #15: size = 7276234302256, 105608 file(s) 1430679 file(s) found.

...and then I'm seeing this in the parsyncfp output:
    | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-04-21]
Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo] 05.34.05 422.72 1.72 5.00 / 0.00 3 <> 0 [16] of [16] INFO: Waiting [31502]s for next chunkfile..

In my experience, it will never finish waiting. I've been just killing off the parsyncfp script at the end when there are no more rsyncs, but obviously it would be better to avoid that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-617966837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y6P5TBMR546VZ6WFJLRN44TTANCNFSM4JL7XUVQ .

Harry Mangalam

novosirj commented 4 years ago

Thanks, Harry -- I didn't consider that, that a new file could be causing the problem. It really does seem to be parsyncfp that is having the problem though. This counter keeps incrementing:

INFO: Waiting [239372]s for next chunkfile..

...and fpart is no longer running:

[root@quorum03 fpcache]# ps -ef | grep fpart
root     25927  4342  0 00:39 pts/1    00:00:00 grep --color=auto fpart

...or am I misunderstanding what you were saying?

hjmangalam commented 4 years ago

Hmm - well since pfp called fpart, pfp should deal with fpart's unexpected death smoothly. It probably does not. I'll look into that. I thought it checked on every loop that the fpart PID was still running but I might have buggered that check. It's too late for coding for a geezer like me. I'll check after morning coffee. Again, thanks for using it and especially taking the time to write up the bug. Best Harry

On Fri, Apr 24, 2020 at 9:40 PM Ryan Novosielski notifications@github.com wrote:

Thanks, Harry -- I didn't consider that, that a new file could be causing the problem. It really does seem to be parsyncfp that is having the problem though. This counter keeps incrementing:

INFO: Waiting [239372]s for next chunkfile.. ...and fpart is no longer running:

[root@quorum03 fpcache]# ps -ef | grep fpart root 25927 4342 0 00:39 pts/1 00:00:00 grep --color=auto fpart

...or am I misunderstanding what you were saying?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619319485, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YZARVXKKWWBPNYJD4TROJSSRANCNFSM4JL7XUVQ .

Harry Mangalam

novosirj commented 4 years ago

Thank you! We use this software a lot (Bill Abbott is a colleague of mine -- I know you and he worked on reading from a list vs. walking the FS).

hjmangalam commented 4 years ago

Some additional info, post-coffee. Meandering thru the parsyncfp code.. That pfp warning: [[ INFO: Waiting [31502]s for next chunkfile.. ]] only is emitted if the specific fpart PID is still running: [[ $FPART_RUNNING = ps ux | grep fpar[t] | grep $FPART_PID | wc -l; chomp $FPART_RUNNING; while (! -e $CUR_FP_FLE && $FPART_RUNNING eq '1'){ if ($VERBOSE >= 2) {INFO("Waiting [$cfw]s for next chunkfile..\r"); sleep 2; $cfw += 2;} } ]]

The next time you run pfp, you should be able to verify this by 'ps'. If that's the case, it may be an fpart-specific issue - fpart still running, but hung for some problem. You could check this by doing a strace on the fpart PID.

Also, checking the fpart code, that error is supposed to name the file that it's having trouble with, but in your case, there's nothing, suggesting that possibly your file mistakenly got named as whitespace (space, tab, newline, etc). Is that a possibility?

hjm

On Fri, Apr 24, 2020 at 9:55 PM Ryan Novosielski notifications@github.com wrote:

Thank you! We use this software a lot (Bill Abbott is a colleague of mine -- I know you and he worked on reading from a list vs. the FS).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619321056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y66TALNEYGZINEGGODROJUNJANCNFSM4JL7XUVQ .

Harry Mangalam

novosirj commented 4 years ago

I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*?

If so:

total 129464
drwxr-xr-x 2 root root      300 Apr 21 23:42 .
drwxr-xr-x 3 root root     4096 Apr 21 23:42 ..
-rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0
-rw-r--r-- 1 root root  8882982 Apr 21 23:01 f.1
-rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10
-rw-r--r-- 1 root root  9335152 Apr 21 23:01 f.11
-rw-r--r-- 1 root root  5728710 Apr 21 23:42 f.12
-rw-r--r-- 1 root root  4113466 Apr 21 23:42 f.13
-rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14
-rw-r--r-- 1 root root  9944264 Apr 21 23:42 f.15
-rw-r--r-- 1 root root  9261208 Apr 21 23:01 f.2
-rw-r--r-- 1 root root  5349997 Apr 21 23:01 f.3
-rw-r--r-- 1 root root  7775218 Apr 21 23:01 f.4
-rw-r--r-- 1 root root  6749460 Apr 21 23:01 f.5
-rw-r--r-- 1 root root  7301965 Apr 21 23:01 f.6
-rw-r--r-- 1 root root  3237238 Apr 21 23:01 f.7
-rw-r--r-- 1 root root  7605602 Apr 21 23:01 f.8
-rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9
-rw-r--r-- 1 root root      935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21
-rw-r--r-- 1 root root        5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21
-rw-r--r-- 1 root root       84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21
[root@quorum03 fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l
0
[root@quorum03 fpcache]# ps aux | grep fpar
root     15263  0.0  0.0 112708   976 pts/1    S+   18:57   0:00 grep --color=auto fpar
[root@quorum03 fpcache]# ps aux | grep fpar[t]
[root@quorum03 fpcache]#

Just for fun, I looked at what parsyncfp is doing:

strace: Process 5439 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460)       = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460)       = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know.

hjmangalam commented 4 years ago

I think I found the reason you're seeing that bug - the fpart PID wasn't refreshed during the test, allowing the old PID to remain valid infinitely, hence the long waiting period.

Try the new version (1.64 / Covid) to see if it solves your problem. It is odd that you are the only person to report this - I've never seen this problem before, so it may be a function of the particular set of files and dirs that you have that allowed this bug to manifest. Let me know.. Harry

On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski notifications@github.com wrote:

I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*?

If so:

total 129464 drwxr-xr-x 2 root root 300 Apr 21 23:42 . drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 [root@quorum03 fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l 0 [root@quorum03 fpcache]# ps aux | grep fpar root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar [root@quorum03 fpcache]# ps aux | grep fpar[t]

Just for fun, I looked at what parsyncfp is doing:

strace: Process 5439 attached restart_syscall(<... resuming interrupted nanosleep ...>) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619451928, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ .

Harry Mangalam

ADiEmme commented 4 years ago

Hi all,

Sorry to chime in, but i think that this might (maybe) help. We've been seeing the same behavior, when doing a copy of many TB and when most of these files were so many small files. I think that for some reason, this breaks fpart.

I know this isn't much, but it's all i remember for now :)

Wish you all a pleasant day!

Regards, Andrea

On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam notifications@github.com wrote:

I think I found the reason you're seeing that bug - the fpart PID wasn't refreshed during the test, allowing the old PID to remain valid infinitely, hence the long waiting period.

Try the new version (1.64 / Covid) to see if it solves your problem. It is odd that you are the only person to report this - I've never seen this problem before, so it may be a function of the particular set of files and dirs that you have that allowed this bug to manifest. Let me know.. Harry

On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski <notifications@github.com

wrote:

I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*?

If so:

total 129464 drwxr-xr-x 2 root root 300 Apr 21 23:42 . drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 [root@quorum03 fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l 0 [root@quorum03 fpcache]# ps aux | grep fpar root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar [root@quorum03 fpcache]# ps aux | grep fpar[t]

Just for fun, I looked at what parsyncfp is doing:

strace: Process 5439 attached restart_syscall(<... resuming interrupted nanosleep ...>) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619451928>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ

.

--

Harry Mangalam

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619478024, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ .

hjmangalam commented 4 years ago

Hi Andrea, Thanks very much for chiming in. (In reading back thru the thread, I see that I neglected to answer some of your points, which may have addressed this bug as well - I'll take another look at them today or tomorrow).

In talking to Ganael about the possible fpart bug, I think it's more likely a bug in the pfp code - specifically the one I referred to yesterday that could introduce a 'wait-forever' bug. I think I fixed it, but bc it tends to show up on very large transfers, I'm still trying to test it on situations that will fail otherwise. If you want to try the latest version on your large transfers, that would help as well. The only significant change was that fix. hjm

On Mon, Apr 27, 2020 at 1:11 AM Andrea Del Monaco notifications@github.com wrote:

Hi all,

Sorry to chime in, but i think that this might (maybe) help. We've been seeing the same behavior, when doing a copy of many TB and when most of these files were so many small files. I think that for some reason, this breaks fpart.

I know this isn't much, but it's all i remember for now :)

Wish you all a pleasant day!

Regards, Andrea

On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam notifications@github.com wrote:

I think I found the reason you're seeing that bug - the fpart PID wasn't refreshed during the test, allowing the old PID to remain valid infinitely, hence the long waiting period.

Try the new version (1.64 / Covid) to see if it solves your problem. It is odd that you are the only person to report this - I've never seen this problem before, so it may be a function of the particular set of files and dirs that you have that allowed this bug to manifest. Let me know.. Harry

On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski < notifications@github.com

wrote:

I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*?

If so:

total 129464 drwxr-xr-x 2 root root 300 Apr 21 23:42 . drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 [root@quorum03 fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l 0 [root@quorum03 fpcache]# ps aux | grep fpar root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar [root@quorum03 fpcache]# ps aux | grep fpar[t]

Just for fun, I looked at what parsyncfp is doing:

strace: Process 5439 attached restart_syscall(<... resuming interrupted nanosleep ...>) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619451928 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ

.

--

Harry Mangalam

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619478024>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619811129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YZJTH7H4PQLWVNCVADROU427ANCNFSM4JL7XUVQ .

Harry Mangalam

hjmangalam commented 4 years ago

Hi All,

Pretty sure (touch wood) that this last set of changes has addressed this 'failure to terminate' bug. I've tested it on some TB/8M file dirs and it seems to be behaving itself. In examining the code, I've also identified a number of changes that will speed up the start and especially increase the speed of sync'ing substantially resync'ed dirs. Those will require some substantial changes so I've not included them in the current 1.65.

Please give it a try and see if this fixes the bugs you identified. Again, thanks for your help in making it better. Harry

On Mon, Apr 27, 2020 at 1:11 AM Andrea Del Monaco notifications@github.com wrote:

Hi all,

Sorry to chime in, but i think that this might (maybe) help. We've been seeing the same behavior, when doing a copy of many TB and when most of these files were so many small files. I think that for some reason, this breaks fpart.

I know this isn't much, but it's all i remember for now :)

Wish you all a pleasant day!

Regards, Andrea

On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam notifications@github.com wrote:

I think I found the reason you're seeing that bug - the fpart PID wasn't refreshed during the test, allowing the old PID to remain valid infinitely, hence the long waiting period.

Try the new version (1.64 / Covid) to see if it solves your problem. It is odd that you are the only person to report this - I've never seen this problem before, so it may be a function of the particular set of files and dirs that you have that allowed this bug to manifest. Let me know.. Harry

On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski < notifications@github.com

wrote:

I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*?

If so:

total 129464 drwxr-xr-x 2 root root 300 Apr 21 23:42 . drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 [root@quorum03 fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l 0 [root@quorum03 fpcache]# ps aux | grep fpar root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar [root@quorum03 fpcache]# ps aux | grep fpar[t]

Just for fun, I looked at what parsyncfp is doing:

strace: Process 5439 attached restart_syscall(<... resuming interrupted nanosleep ...>) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619451928 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ

.

--

Harry Mangalam

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619478024>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/30#issuecomment-619811129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YZJTH7H4PQLWVNCVADROU427ANCNFSM4JL7XUVQ .

Harry Mangalam

novosirj commented 4 years ago

I haven't had a chance to try the newer version yet, which I should be able to do try for tomorrow's backup.

FYI, this week running 1.61, all three campuses eventually wound up in the endless "waiting for chunkfile" state:

02.40.06   6569.20     1.03      42.09 / 0.00            10    <>   0          [43] of [43]
03.24.38   6613.73     1.59      28.85 / 0.00            10    <>   0          [43] of [43]
04.09.12   6658.30     1.15      48.02 / 0.00            10    <>   0          [43] of [43]
04.53.54   6703.00     1.74      40.93 / 0.00            10    <>   0          [43] of [43]
05.38.39   6747.75     1.58      30.90 / 0.00            10    <>   0          [43] of [43]
06.23.00   6792.10     1.15      28.70 / 0.00            10    <>   0          [43] of [43]

==> /var/log/asb-parsyncfp-20200429.err <==
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]

==> /var/log/asb-parsyncfp-20200429.log <==
 INFO: Waiting [16872]s for next chunkfile..

As you can see, this one ran for a really long time before that happened. I think that this problem appeared between 1.55 and 1.61, but it's hard to be sure, since the content could have changed instead and caused the problem (maybe a little less likely given that it happened on all campuses).