Open novosirj opened 4 years ago
Here is STDOUT (.log.log
) and STDERR (.err.log
), both trimmed in the same places to the relevant portion.
One thing that does jump out at me is that "[0m" that appears in both. I'm not sure the origin of that.
But you can see in the logs, one of the times it looks like parsyncfp
maybe just exits way too soon (also evidenced by the fact that it spends a lot of time in that pgrep -x rsync
loop) the first time, and works as expected the second time?
asb-parsyncfp-20200429.err.log asb-parsyncfp-20200429.log.log
Hi Ryan, Something is indeed weird here. If you're feeding fpart a list of files (especially if you're including the sizes (and you are since you're using the '--trustme' option), it should be processed very quickly, as it is apparently in the second (projects) run, which adds chunks every few seconds.
In the first (cache) run, you're completing a chunk every ~a minute. Are there huge differences in the distribution of file sizes between the 2 filesystems?
Here are some suggestions that might show where things are going bad:
The odd characters you noted "[0m" that appears in both logs is probably part of the pfp terminal color change escape sequence. I don't think that would introduce such a difference, but .. I can't guarantee that it couldn't.. I'll keep trying to come up with an explanation as I answer your other email. hjm
On Wed, Apr 29, 2020 at 2:28 PM Ryan Novosielski notifications@github.com wrote:
Here is STDOUT (.log.log) and STDERR (.err.log), both trimmed in the same places to the relevant portion.
One thing that does jump out at me is that "[0m" that appears in both. I'm not sure the origin of that.
But you can see in the logs, one of the times it looks like parsyncfp maybe just exits way too soon (also evidenced by the fact that it spends a lot of time in that pgrep -x rsync loop) the first time, and works as expected the second time?
asb-parsyncfp-20200429.err.log https://github.com/hjmangalam/parsyncfp/files/4554664/asb-parsyncfp-20200429.err.log asb-parsyncfp-20200429.log.log https://github.com/hjmangalam/parsyncfp/files/4554665/asb-parsyncfp-20200429.log.log
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-621474273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YZTFCZ32DDCIUDJPDLRPCLWJANCNFSM4MUBX57Q .
--
Harry Mangalam
The "cache" filesystem and the "projects" filesystem do likely have a much different file size histogram. "cache" has 152970416 inodes in use, while "projects" has 115009601, and "projects" is ~2.5 times larger in disk space than cache. Lots of people have build directories for software in there, etc. One git repo can really up the numbers (we've been thinking we need to move to inode quotas on that filesystem).
I'll have to save the .parsyncfp-backups
directory or write them to a unique ID or something. The second run clears the files of the first run. I can also do another run if it completes; shouldn't hurt anything since it's already been synced for the week. Most of the other suggestions hinge on having the files around from both runs to compare, seems like.
Hmm - even so, that's an average of ~3.6s per chunk for projects and ~60s per chunk for cache. Re: the overwrite of the cache files, if you can afford the space, use the FILESYSTEM variable to name the altcache so they don't get overwritten. hjm
On Wed, Apr 29, 2020 at 10:01 PM Ryan Novosielski notifications@github.com wrote:
The "cache" filesystem and the "projects" filesystem do likely have a much different file size histogram. "cache" has 152970416 inodes in use, while "projects" has 115009601, and "projects" is ~2.5 times larger than cache.
I'll have to save the .parsyncfp-backups directory or write them to a unique ID or something. The second run clears the files of the first run. I can also do another run if it completes; shouldn't hurt anything since it's already been synced for the week. Most of the other suggestions hinge on having the files around from both runs to compare, seems like.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-621613486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y54M27HVKCOVFHY2QDRPEAY3ANCNFSM4MUBX57Q .
--
Harry Mangalam
Yup; and I should have space. FYI, the list for the "cache" filesystem is itself 20G, while the "projects" filesystem list is 16GB. Of course that size also depends on the length of the PATH names, etc., but here are some line counts:
[root@quorum01 policy]# wc -l cache.list.allfiles.clean projectsp.list.allfiles.clean
164657014 cache.list.allfiles.clean
109127324 projectsp.list.allfiles.clean
273784338 total
OK - I found a few problems, now corrected. The worst one was that in (partially) adding another option, I broke the list processing (and I found some other minor problems in the process).
It now seems to work with lists (--fromlist), with lists with sizes (--trustme), and the --trimpath options the way it did in the past (and perhaps a little better error checking).
Please try the new version 1.67 and tell me how it goes. hjm
On Wed, Apr 29, 2020 at 11:01 PM Ryan Novosielski notifications@github.com wrote:
Yup; and I should have space. FYI, the list for the "cache" filesystem is itself 20G, while the "projects" filesystem list is 16GB. Of course that size also depends on the length of the PATH names, etc., but here are some line counts:
[root@quorum01 policy]# wc -l cache.list.allfiles.clean projectsp.list.allfiles.clean 164657014 cache.list.allfiles.clean 109127324 projectsp.list.allfiles.clean 273784338 total
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-621630407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YZFNP65DYGZVCIVZE3RPEH2JANCNFSM4MUBX57Q .
--
Harry Mangalam
Great! Will try it tomorrow/sooner if the current runs complete.
So far so good! Will keep you posted, but those that have gotten to this point in the script are running well (the list generation takes awhile).
Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem.
Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10).
I generally try to shoot for about 3-5 times the number of rsyncs. That way the rssyncs are kept busy with too much churn and fpart can generate chunks quickly enough that you're not waiting forever to start. Are you using fpart to recurse the dirs or are you using GPFS to generate the lists that you want to transfer. Generating via lists (esp with the --trustme option - you provide the sizes) should be much faster than a full recursion. I realize that startup time is significant on large FSs and I'm still thinking about how to improve on this delay. fpart does a very good job already. The only thing that might improve this performance is multi-threading the recursion, which I believe qdirstat does. But then all those threads are all stat-hammering and it's not clear if that's going to lead to a huge improvement. Harry
On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski notifications@github.com wrote:
Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem.
Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-624463932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q .
--
Harry Mangalam
BY the way, pls let me know if your pfp runs terminate correctly. they have on all my tests of the latest version, but it's always /yours/ that are the critical ones.
harry
On Wed, May 6, 2020 at 8:12 AM Harry Mangalam hjmangalam@gmail.com wrote:
I generally try to shoot for about 3-5 times the number of rsyncs. That way the rssyncs are kept busy with too much churn and fpart can generate chunks quickly enough that you're not waiting forever to start. Are you using fpart to recurse the dirs or are you using GPFS to generate the lists that you want to transfer. Generating via lists (esp with the --trustme option - you provide the sizes) should be much faster than a full recursion. I realize that startup time is significant on large FSs and I'm still thinking about how to improve on this delay. fpart does a very good job already. The only thing that might improve this performance is multi-threading the recursion, which I believe qdirstat does. But then all those threads are all stat-hammering and it's not clear if that's going to lead to a huge improvement. Harry
On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski notifications@github.com wrote:
Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem.
Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-624463932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q .
--
Harry Mangalam
--
Harry Mangalam
Oh, I'm sorry that that wasn't clear: the list generation happens via GPFS, so well before we run PFP. Probably 90 mins of that before we even get to PFP on the "cache" or "projects" filesystem on the one campus. But so far both runs on "cache" and "projects" have run, displayed the transfer output, and the "cache" FS completed normally already. None of the other ones have completed yet, but are behaving and relatively close to the end.
I did notice that the list generation inside PFP did seem to take a bit longer -- like I could watch it generate list parts and it was over a minute each on one of the sites and about a minute total on the others (those used to be nearly instantaneous), but it's not an amount of time that manners given how long this runs for. It also seems like maybe this frmlist.tmp file is new, large, and potentially the source of some of the slowdown as that directory doesn't have the best write performance. It also seems to contain exactly the same information as the file we feed in, but I'm assuming it does some sort of massaging/validation on that file?
This was a little weird, but it got past it.
17.34.14 1127.03 5.63 3.04 / 0.00 4 <> 0 [17] of [17]
| Elapsed | 1m | [ ens6] MB/s | Running || Susp'd | Chunks [2020-05-06]
Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo]
INFO: Waiting [0]s for next chunkfile..
18.30.18 1183.10 0.99 2.43 / 0.00 4 <> 0 [17] of [17]
We can leave this one open for another go-round, but I suspect it's solved in 1.67. No issue this time, and it seems likely it was caused by having a number of parts that was smaller than the number of rsync processes.
Had a different sort of issue this week. As you may recall, we've had the problem where the PFP runs forever waiting for a new chunkfile. This week, on the same/habitual trouble campus, PFP exited way before the rsyncs from PFP exited. I have all of the cache files so we can attempt to isolate the problem if that's helpful. Also, we have no backup currently running (the tuning we did improved the speed greatly), so I have a little more flexibility than usual.
Hmm. Since this is a new problem, could you start it in a new thread?
I'll wait for the new thread to answer substantially. hjm
On Wed, May 13, 2020 at 10:46 AM Ryan Novosielski notifications@github.com wrote:
Had a different sort of issue this week. As you may recall, we've had the problem where the PFP runs forever waiting for a new chunkfile. This week, on the same/habitual trouble campus, PFP exited way before the rsyncs from PFP exited. I have all of the cache files so we can attempt to isolate the problem if that's helpful. Also, we have no backup currently running (the tuning we did improved the speed greatly), so I have a little more flexibility than usual.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-628145612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y6HMV6S55KB42UIVULRRLMIRANCNFSM4MUBX57Q .
--
Harry Mangalam
No problem; they sort of interrelate, and it was in the opening to this ticket:
Possibly worth nothing that that last while loop I believe is to deal with another bug someplace where parsyncfp exits but for some reason it hasn't waited for all of the rsync process to exit. This might be related to some of the previous PID-related problems. I do believe I've seen it have to wait at this point recently, but that may be fixed with 1.65+.
But I agree, it's not really related to the title anyway. This past run did show the status output while it was running. I'll close this one if I don't see it reappear.
Sorry to be burying you in issues. I just want to get these reported so you know about them/so you can help if I'm just doing something wrong.
Our backup script runs paryncfp twice on the only campus where I've been seeing this. The function does the following (I've stripped out all of the lines that are commented or do an echo of status for brevity purposes, and only have showed that function and the place where it is called -- have omitted other stuff above, below, and where the ellipsis are):
Possibly worth nothing that that last while loop I believe is to deal with another bug someplace where parsyncfp exits but for some reason it hasn't waited for all of the rsync process to exit. This might be related to some of the previous PID-related problems. I do believe I've seen it have to wait at this point recently, but that may be fixed with 1.65+.
Anyhow, the weird behavior is this: as you can see above, that
parallel()
function is called twice, only with different arguments. So theparsyncfp
command uses all of the same flags. On this most recent run, the first run for thecache
filesystem printed no output whatsoever about progress -- no header line, and no status lines after it. However the second run, againstprojects
did print the header and is outputting status lines at the expected interval.I'm not entirely sure where to look at this for debugging. I don't think this version uses
status
. At some point I was running it withoutscut
but I now have that, so unless the PATH is not working correctly, I'm not sure what's going wrong.I'll attach the relevant output (we log STDOUT and STDERR to two different files).