Open novosirj opened 6 years ago
I'm experiencing the problem right now where it sits around waiting for the last rsync to exit but there's not a single one running and it also reports it's working on a step that doesn't exist:
[root@quorum ~]# ps -ef | grep rsync
root 6906 6904 3 Jul18 pts/0 01:23:03 perl /usr/local/bin/parsyncfp -NP 4 --chunksize=100G --rsyncopts --stats -e "ssh -T -c arcfour -o Compression=no -x" --maxbw=1250000 --startdir /gpfs/home/.snapshots home nas1:/zdata/gss
root 22357 9236 0 22:32 pts/6 00:00:00 grep --color=auto rsync
[root@quorum ~]#
Any idea? What additional info might be helpful?
Looks like maybe a couple of PIDs that are in the rsync PID list either got reused or incorrectly picked up for inclusion in the rsync PID list?
[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do ps -fp $pid | grep -v UID.*PID ; done
root 24309 930 0 22:43 ? 00:00:00 sleep 60
root 20472 2 0 21:46 ? 00:00:00 [kworker/8:2]
root 24848 2 0 22:11 ? 00:00:00 [kworker/9:1]
[root@quorum ~]# ps -fp 930
UID PID PPID C STIME TTY TIME CMD
root 930 1 0 Jun27 ? 00:00:36 /bin/bash /usr/sbin/ksmtuned
Looks like more likely the former. A few minutes, the list has changed some:
[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do ps -fp $pid | grep -v UID.*PID ; done
root 9954 2 0 22:47 ? 00:00:00 [kworker/1:1]
root 20472 2 0 21:46 ? 00:00:00 [kworker/8:2]
root 24848 2 0 22:11 ? 00:00:00 [kworker/9:1]
On Thursday, July 19, 2018 7:45:38 PM PDT novosirj wrote:
Looks like maybe a couple of PIDs that are in the rsync PID list either got reused or incorrectly picked up for inclusion in the rsync PID list?
[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do ps -fp $pid | grep -v UID.*PID ; done root 24309 930 0 22:43 ? 00:00:00 sleep 60 root 20472 2 0 21:46 ? 00:00:00 [kworker/8:2] root 24848 2 0 22:11 ? 00:00:00 [kworker/9:1] [root@quorum ~]# ps -fp 930 UID PID PPID C STIME TTY TIME CMD root 930 1 0 Jun27 ? 00:00:36 /bin/bash /usr/sbin/ksmtuned ```
that's certainly a possibility, especially if the pfp runs long enough for PIDs to loop around. The OS tracks the reuse of PID #s, but pfp could get confused. I've noticed this a few times myself, but was never able to pin it on anything.
However, I'll check this more carefully and add a db structure to track the PIDs more accurately.
thanks for this new idea. hjm
Harry Mangalam, Info[1]
I've got this condition going on right now still. So if there's any more data it would be helpful for you to have, I can provide it now as it's still stuck. Seems always to manage to start a new process to collide with the process list.
If there's no information you can use, any tips on how to get it to successfully complete? :-D
On Friday, July 20, 2018 1:58:41 PM PDT novosirj wrote:
I've got this condition going on right now still. So if there's any more data it would be helpful for you to have, I can provide it now as it's still stuck. Seems always to manage to start a new process to collide with the process list.
If there's no information you can use, any tips on how to get it to successfully complete? :-D
Not really at this point - I'll need to generate a full set of files myself (but luckily I have lots of test cases to pull from). The only way to force a finish is the horrible - kill pfp and start an rsync to finish the job. If the sync has mostly gone to completion, it should be pretty fast.
You could also try fpsync to clean up after pfp - Ganael would love that ;)
Best to you (and Ganael) Harry
Harry Mangalam, Info[1]
In our case, it's really no big deal. Since it actually completed and it's acting on a snapshot, it makes no real difference if it completes successfully (in fact, it's hard to tell the difference, looking now at the logfile after I killed it), and you really need to run an rsync --delete outside of parsyncfp to pick up deletions anyway.
Thanks for your help and for this great tool!
I have found a new way this situation with the PIDs causes problems: right now, there are only 2 rsync processes -- less than the allowed 4:
[root@quorum ~]# pgrep rsync
20803
20814
However parsyncfp reports 6:
00.06.35 2176.58 2.07 0.01 6 <> 0 [2717] of [2822]
...which I assume is hurting the transfer speed (I generally allow for 4 of them).
I suspect that this is related:
[root@quorum ~]# wc -l ~/.parsyncfp-backups/fpcache/rsync-PIDs-11.36.53_2018-08-28
2717 /root/.parsyncfp-backups/fpcache/rsync-PIDs-11.36.53_2018-08-28
I haven't read that part of the source code carefully, but is it really checking to see if any of those 2717 PIDs are running? If so, I bet that's going to be pretty frequently true.
Have noticed a little bit of strange behavior recently with parsyncfp 1.52:
/usr/local/bin/parsyncfp -NP 4 --chunksize=100G --rsyncopts '--stats -e "ssh -T -c arcfour -o Compression=no -x"' --maxbw=1250000 --startdir '/gpfs/home/.snapshots' home nas1:/zdata/gss
We have sort of widely varying file sizes on this FS, and this arragement typically yields in the neighborhood of 2500 chunks, containing anywhere between 2 and 675347 files apiece. However, I've noticed that the number of chunks reported by parsyncfp is larger than the number of chunks that exist, at least during this run and a number of runs before it. From the output:
...but then the fpart.log says there aren't that many:
In at least once instance, I've found that it was actually an earlier chunk that it was still working on -- for example, I discovered that one chunk had a 95T sparse file in it (chunk #2462), but the final output before I terminated the rsync said:
...
I started looking into this sometime around 16:40 and there were no rsyncs running, despite what this output said. I'm not entirely sure what caused it to stop at 16:42. I took a look through the log at the period of time during that chunk where only 1 rsync was running. Only 3 lines are above 0.00 MB/s:
Am I onto something here, either as two separate problems or one related issue with a couple of symptoms?
Just want to end by saying thank you for your great tool!