hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf
Other
161 stars 19 forks source link

A couple of possibly related problems with parsyncfp 1.52 #11

Open novosirj opened 6 years ago

novosirj commented 6 years ago

Have noticed a little bit of strange behavior recently with parsyncfp 1.52:

  1. We use parsyncfp on a very large filesystem (585TB) and do the following command:

/usr/local/bin/parsyncfp -NP 4 --chunksize=100G --rsyncopts '--stats -e "ssh -T -c arcfour -o Compression=no -x"' --maxbw=1250000 --startdir '/gpfs/home/.snapshots' home nas1:/zdata/gss

We have sort of widely varying file sizes on this FS, and this arragement typically yields in the neighborhood of 2500 chunks, containing anywhere between 2 and 675347 files apiece. However, I've noticed that the number of chunks reported by parsyncfp is larger than the number of chunks that exist, at least during this run and a number of runs before it. From the output:

[root@quorum fpcache]# grep -B23 "INFO: Done" /var/log/asb.log
        | Elapsed |   1m   |    MB/s   | Running || Susp'd  |      Chunks       [2018-06-21]
  Time  | time(m) |  Load  | [  bond0] |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
16.41.47   3037.82     0.11       0.00        1    <>   0          [2472] of [2472]
16.41.51   3037.88     0.11       0.00        1    <>   0          [2472] of [2472]
16.41.54   3037.93     0.10       0.00        1    <>   0          [2472] of [2472]
16.41.57   3037.98     0.09       0.00        1    <>   0          [2472] of [2472]
16.42.00   3038.03     0.09       0.00        1    <>   0          [2472] of [2472]
16.42.03   3038.08     0.08       0.00        1    <>   0          [2472] of [2472]
16.42.06   3038.13     0.08       0.00        1    <>   0          [2472] of [2472]
16.42.10   3038.20     0.08       0.00        1    <>   0          [2472] of [2472]
16.42.13   3038.25     0.07       0.00        1    <>   0          [2472] of [2472]
16.42.16   3038.30     0.07       0.00        1    <>   0          [2472] of [2472]
16.42.19   3038.35     0.06       0.00        1    <>   0          [2472] of [2472]
16.42.22   3038.40     0.06       0.00        1    <>   0          [2472] of [2472]
16.42.25   3038.45     0.06       0.00        1    <>   0          [2472] of [2472]
16.42.29   3038.52     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.32   3038.57     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.35   3038.62     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.38   3038.67     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.41   3038.72     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.44   3038.77     0.04       0.00        1    <>   0          [2472] of [2472]
16.42.48   3038.83     0.04       0.00        0    <>   0          [2472] of [2472]

INFO: Done.  Please check the target to make sure 

...but then the fpart.log says there aren't that many:

[root@quorum fpcache]# tail -5 fpart.log.13.49.48_2018-06-19 
Filled part #2468: size = 107551197313, 231318 file(s)
Filled part #2469: size = 107514041195, 393 file(s)
Filled part #2470: size = 117863687441, 853224 file(s)
Filled part #2471: size = 50532682569, 132844 file(s)
141925735 file(s) found.

In at least once instance, I've found that it was actually an earlier chunk that it was still working on -- for example, I discovered that one chunk had a 95T sparse file in it (chunk #2462), but the final output before I terminated the rsync said:

         | Elapsed |   1m   |    MB/s   | Running || Susp'd  |      Chunks       [2018-06-07]
   Time  | time(m) |  Load  | [  bond0] |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
16.44.37   3219.03     1.33       0.00        1    <>   0          [2483] of [2483]

INFO: Done.  Please check the target to make sure
  1. Today, parsyncfp appears to have been running for most of the day -- since 09:06 this morning -- without there actually having been any rsync processes associated with it:
        | Elapsed |   1m   |    MB/s   | Running || Susp'd  |      Chunks       [2018-06-21]
  Time  | time(m) |  Load  | [  bond0] |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
09.06.34   2582.60     0.39     101.50        2    <>   0          [2472] of [2472]
09.06.37   2582.65     0.44     207.66        2    <>   0          [2472] of [2472]
09.06.40   2582.70     0.44     232.15        2    <>   0          [2472] of [2472]
09.06.43   2582.75     0.49     138.49        2    <>   0          [2472] of [2472]

Number of files: 873431
Number of files transferred: 266
Total file size: 117863698434 bytes
Total transferred file size: 975369047 bytes
Literal data: 847069151 bytes
Matched data: 128299896 bytes
File list size: 17157260
File list generation time: 1301.935 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 864491626
Total bytes received: 388752

sent 864491626 bytes  received 388752 bytes  618434.31 bytes/sec
total size is 117863698434  speedup is 136.28
09.06.47   2582.82     0.53      25.52        2    <>   0          [2472] of [2472]
09.06.50   2582.87     0.53       0.00        1    <>   0          [2472] of [2472]
09.06.53   2582.92     0.49       0.00        1    <>   0          [2472] of [2472]
09.06.56   2582.97     0.49       0.00        1    <>   0          [2472] of [2472]
09.06.59   2583.02     0.45       0.00        1    <>   0          [2472] of [2472]
09.07.02   2583.07     0.41       0.00        1    <>   0          [2472] of [2472]
09.07.06   2583.13     0.41       0.00        1    <>   0          [2472] of [2472]
09.07.09   2583.18     0.38       0.00        1    <>   0          [2472] of [2472]
09.07.12   2583.23     0.35       0.00        1    <>   0          [2472] of [2472]
09.07.15   2583.28     0.35       0.00        1    <>   0          [2472] of [2472]
09.07.18   2583.33     0.32       0.00        1    <>   0          [2472] of [2472]
09.07.21   2583.38     0.37       0.00        1    <>   0          [2472] of [2472]
09.07.25   2583.45     0.37       0.00        1    <>   0          [2472] of [2472]
09.07.28   2583.50     0.50       0.00        1    <>   0          [2472] of [2472]
09.07.31   2583.55     0.50       0.00        1    <>   0          [2472] of [2472]
09.07.34   2583.60     0.46       0.00        1    <>   0          [2472] of [2472]
09.07.37   2583.65     0.43       0.00        1    <>   0          [2472] of [2472]

...

16.42.38   3038.67     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.41   3038.72     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.44   3038.77     0.04       0.00        1    <>   0          [2472] of [2472]
16.42.48   3038.83     0.04       0.00        0    <>   0          [2472] of [2472] 

I started looking into this sometime around 16:40 and there were no rsyncs running, despite what this output said. I'm not entirely sure what caused it to stop at 16:42. I took a look through the log at the period of time during that chunk where only 1 rsync was running. Only 3 lines are above 0.00 MB/s:

[root@quorum fpcache]# grep "\\[2472\\] of \\[2472\\]" /var/log/asb.log | grep "1    <>" | sort -n -k4 | tail -5
16.42.41   3038.72     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.44   3038.77     0.04       0.00        1    <>   0          [2472] of [2472]
12.57.57   2813.98     0.09       0.04        1    <>   0          [2472] of [2472]
12.58.00   2814.03     0.09       0.04        1    <>   0          [2472] of [2472]
12.13.13   2769.25     0.09       0.25        1    <>   0          [2472] of [2472]

Am I onto something here, either as two separate problems or one related issue with a couple of symptoms?

Just want to end by saying thank you for your great tool!

novosirj commented 6 years ago

I'm experiencing the problem right now where it sits around waiting for the last rsync to exit but there's not a single one running and it also reports it's working on a step that doesn't exist:

[root@quorum ~]# ps -ef | grep rsync
root      6906  6904  3 Jul18 pts/0    01:23:03 perl /usr/local/bin/parsyncfp -NP 4 --chunksize=100G --rsyncopts --stats -e "ssh -T -c arcfour -o Compression=no -x" --maxbw=1250000 --startdir /gpfs/home/.snapshots home nas1:/zdata/gss
root     22357  9236  0 22:32 pts/6    00:00:00 grep --color=auto rsync
[root@quorum ~]# 

Any idea? What additional info might be helpful?

novosirj commented 6 years ago

Looks like maybe a couple of PIDs that are in the rsync PID list either got reused or incorrectly picked up for inclusion in the rsync PID list?

[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do ps -fp $pid | grep -v UID.*PID ; done
root     24309   930  0 22:43 ?        00:00:00 sleep 60
root     20472     2  0 21:46 ?        00:00:00 [kworker/8:2]
root     24848     2  0 22:11 ?        00:00:00 [kworker/9:1]
[root@quorum ~]# ps -fp 930
UID        PID  PPID  C STIME TTY          TIME CMD
root       930     1  0 Jun27 ?        00:00:36 /bin/bash /usr/sbin/ksmtuned

Looks like more likely the former. A few minutes, the list has changed some:

[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do ps -fp $pid | grep -v UID.*PID ; done
root      9954     2  0 22:47 ?        00:00:00 [kworker/1:1]
root     20472     2  0 21:46 ?        00:00:00 [kworker/8:2]
root     24848     2  0 22:11 ?        00:00:00 [kworker/9:1]
hjmangalam commented 6 years ago

On Thursday, July 19, 2018 7:45:38 PM PDT novosirj wrote:

Looks like maybe a couple of PIDs that are in the rsync PID list either got reused or incorrectly picked up for inclusion in the rsync PID list?


[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do
ps -fp $pid | grep -v UID.*PID ; done root     24309   930  0 22:43 ?   
    00:00:00 sleep 60
root     20472     2  0 21:46 ?        00:00:00 [kworker/8:2]
root     24848     2  0 22:11 ?        00:00:00 [kworker/9:1]
[root@quorum ~]# ps -fp 930
UID        PID  PPID  C STIME TTY          TIME CMD
root       930     1  0 Jun27 ?        00:00:36 /bin/bash
/usr/sbin/ksmtuned ```

that's certainly a possibility, especially if the pfp runs long enough for PIDs to loop around. The OS tracks the reuse of PID #s, but pfp could get confused. I've noticed this a few times myself, but was never able to pin it on anything.

However, I'll check this more carefully and add a db structure to track the PIDs more accurately.

thanks for this new idea. hjm

Harry Mangalam, Info[1]


[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

novosirj commented 6 years ago

I've got this condition going on right now still. So if there's any more data it would be helpful for you to have, I can provide it now as it's still stuck. Seems always to manage to start a new process to collide with the process list.

If there's no information you can use, any tips on how to get it to successfully complete? :-D

hjmangalam commented 6 years ago

On Friday, July 20, 2018 1:58:41 PM PDT novosirj wrote:

I've got this condition going on right now still. So if there's any more data it would be helpful for you to have, I can provide it now as it's still stuck. Seems always to manage to start a new process to collide with the process list.

If there's no information you can use, any tips on how to get it to successfully complete? :-D

Not really at this point - I'll need to generate a full set of files myself (but luckily I have lots of test cases to pull from). The only way to force a finish is the horrible - kill pfp and start an rsync to finish the job. If the sync has mostly gone to completion, it should be pretty fast.

You could also try fpsync to clean up after pfp - Ganael would love that ;)

Best to you (and Ganael) Harry

Harry Mangalam, Info[1]


[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

novosirj commented 6 years ago

In our case, it's really no big deal. Since it actually completed and it's acting on a snapshot, it makes no real difference if it completes successfully (in fact, it's hard to tell the difference, looking now at the logfile after I killed it), and you really need to run an rsync --delete outside of parsyncfp to pick up deletions anyway.

Thanks for your help and for this great tool!

novosirj commented 6 years ago

I have found a new way this situation with the PIDs causes problems: right now, there are only 2 rsync processes -- less than the allowed 4:

[root@quorum ~]# pgrep rsync
20803
20814

However parsyncfp reports 6:

00.06.35 2176.58 2.07 0.01 6 <> 0 [2717] of [2822]

...which I assume is hurting the transfer speed (I generally allow for 4 of them).

I suspect that this is related:

[root@quorum ~]# wc -l ~/.parsyncfp-backups/fpcache/rsync-PIDs-11.36.53_2018-08-28 
2717 /root/.parsyncfp-backups/fpcache/rsync-PIDs-11.36.53_2018-08-28

I haven't read that part of the source code carefully, but is it really checking to see if any of those 2717 PIDs are running? If so, I bet that's going to be pretty frequently true.