hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf
Other
161 stars 19 forks source link

Intermittent problem displaying status output at normal verbosity, parsyncfp-1.61 #34

Open novosirj opened 4 years ago

novosirj commented 4 years ago

Sorry to be burying you in issues. I just want to get these reported so you know about them/so you can help if I'm just doing something wrong.

Our backup script runs paryncfp twice on the only campus where I've been seeing this. The function does the following (I've stripped out all of the lines that are commented or do an echo of status for brevity purposes, and only have showed that function and the place where it is called -- have omitted other stuff above, below, and where the ellipsis are):

parallel() {

export HOMEDIR=/gpfs/cache/home/root/policy
export FILESYSTEM=$1
export SNAPSHOT=$2
export MOUNTPOINT=$3

/usr/lpp/mmfs/bin/mmapplypolicy $FILESYSTEM -S $SNAPSHOT -f $HOMEDIR/$SNAPSHOT -P $HOMEDIR/listfiles.rule -I defer

/usr/bin/cut -d' ' -f'5 7-' $HOMEDIR/$SNAPSHOT.list.allfiles > $HOMEDIR/$SNAPSHOT.list.allfiles.clean

rm -rf /root/.parsyncfp-backups

/usr/local/bin/parsyncfp-1.61 -i ens6 --checkperiod 1800 --nowait --altcache '/root/.parsyncfp-backups' -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=10T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT

sleep 5

while pgrep -x "rsync" >/dev/null 2>&1
do
  sleep 60
done

}

...

parallel cache cache gpfs/cache
parallel projects projectsp projects

Possibly worth nothing that that last while loop I believe is to deal with another bug someplace where parsyncfp exits but for some reason it hasn't waited for all of the rsync process to exit. This might be related to some of the previous PID-related problems. I do believe I've seen it have to wait at this point recently, but that may be fixed with 1.65+.

Anyhow, the weird behavior is this: as you can see above, that parallel() function is called twice, only with different arguments. So the parsyncfp command uses all of the same flags. On this most recent run, the first run for the cache filesystem printed no output whatsoever about progress -- no header line, and no status lines after it. However the second run, against projects did print the header and is outputting status lines at the expected interval.

I'm not entirely sure where to look at this for debugging. I don't think this version uses status. At some point I was running it without scut but I now have that, so unless the PATH is not working correctly, I'm not sure what's going wrong.

I'll attach the relevant output (we log STDOUT and STDERR to two different files).

novosirj commented 4 years ago

Here is STDOUT (.log.log) and STDERR (.err.log), both trimmed in the same places to the relevant portion.

One thing that does jump out at me is that "[0m" that appears in both. I'm not sure the origin of that.

But you can see in the logs, one of the times it looks like parsyncfp maybe just exits way too soon (also evidenced by the fact that it spends a lot of time in that pgrep -x rsync loop) the first time, and works as expected the second time?

asb-parsyncfp-20200429.err.log asb-parsyncfp-20200429.log.log

hjmangalam commented 4 years ago

Hi Ryan, Something is indeed weird here. If you're feeding fpart a list of files (especially if you're including the sizes (and you are since you're using the '--trustme' option), it should be processed very quickly, as it is apparently in the second (projects) run, which adds chunks every few seconds.

In the first (cache) run, you're completing a chunk every ~a minute. Are there huge differences in the distribution of file sizes between the 2 filesystems?

Here are some suggestions that might show where things are going bad:

The odd characters you noted "[0m" that appears in both logs is probably part of the pfp terminal color change escape sequence. I don't think that would introduce such a difference, but .. I can't guarantee that it couldn't.. I'll keep trying to come up with an explanation as I answer your other email. hjm

On Wed, Apr 29, 2020 at 2:28 PM Ryan Novosielski notifications@github.com wrote:

Here is STDOUT (.log.log) and STDERR (.err.log), both trimmed in the same places to the relevant portion.

One thing that does jump out at me is that "[0m" that appears in both. I'm not sure the origin of that.

But you can see in the logs, one of the times it looks like parsyncfp maybe just exits way too soon (also evidenced by the fact that it spends a lot of time in that pgrep -x rsync loop) the first time, and works as expected the second time?

asb-parsyncfp-20200429.err.log https://github.com/hjmangalam/parsyncfp/files/4554664/asb-parsyncfp-20200429.err.log asb-parsyncfp-20200429.log.log https://github.com/hjmangalam/parsyncfp/files/4554665/asb-parsyncfp-20200429.log.log

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-621474273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YZTFCZ32DDCIUDJPDLRPCLWJANCNFSM4MUBX57Q .

--

Harry Mangalam

novosirj commented 4 years ago

The "cache" filesystem and the "projects" filesystem do likely have a much different file size histogram. "cache" has 152970416 inodes in use, while "projects" has 115009601, and "projects" is ~2.5 times larger in disk space than cache. Lots of people have build directories for software in there, etc. One git repo can really up the numbers (we've been thinking we need to move to inode quotas on that filesystem).

I'll have to save the .parsyncfp-backups directory or write them to a unique ID or something. The second run clears the files of the first run. I can also do another run if it completes; shouldn't hurt anything since it's already been synced for the week. Most of the other suggestions hinge on having the files around from both runs to compare, seems like.

hjmangalam commented 4 years ago

Hmm - even so, that's an average of ~3.6s per chunk for projects and ~60s per chunk for cache. Re: the overwrite of the cache files, if you can afford the space, use the FILESYSTEM variable to name the altcache so they don't get overwritten. hjm

On Wed, Apr 29, 2020 at 10:01 PM Ryan Novosielski notifications@github.com wrote:

The "cache" filesystem and the "projects" filesystem do likely have a much different file size histogram. "cache" has 152970416 inodes in use, while "projects" has 115009601, and "projects" is ~2.5 times larger than cache.

I'll have to save the .parsyncfp-backups directory or write them to a unique ID or something. The second run clears the files of the first run. I can also do another run if it completes; shouldn't hurt anything since it's already been synced for the week. Most of the other suggestions hinge on having the files around from both runs to compare, seems like.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-621613486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y54M27HVKCOVFHY2QDRPEAY3ANCNFSM4MUBX57Q .

--

Harry Mangalam

novosirj commented 4 years ago

Yup; and I should have space. FYI, the list for the "cache" filesystem is itself 20G, while the "projects" filesystem list is 16GB. Of course that size also depends on the length of the PATH names, etc., but here are some line counts:

[root@quorum01 policy]# wc -l cache.list.allfiles.clean projectsp.list.allfiles.clean 
  164657014 cache.list.allfiles.clean
  109127324 projectsp.list.allfiles.clean
  273784338 total
hjmangalam commented 4 years ago

OK - I found a few problems, now corrected. The worst one was that in (partially) adding another option, I broke the list processing (and I found some other minor problems in the process).

It now seems to work with lists (--fromlist), with lists with sizes (--trustme), and the --trimpath options the way it did in the past (and perhaps a little better error checking).

Please try the new version 1.67 and tell me how it goes. hjm

On Wed, Apr 29, 2020 at 11:01 PM Ryan Novosielski notifications@github.com wrote:

Yup; and I should have space. FYI, the list for the "cache" filesystem is itself 20G, while the "projects" filesystem list is 16GB. Of course that size also depends on the length of the PATH names, etc., but here are some line counts:

[root@quorum01 policy]# wc -l cache.list.allfiles.clean projectsp.list.allfiles.clean 164657014 cache.list.allfiles.clean 109127324 projectsp.list.allfiles.clean 273784338 total

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-621630407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YZFNP65DYGZVCIVZE3RPEH2JANCNFSM4MUBX57Q .

--

Harry Mangalam

novosirj commented 4 years ago

Great! Will try it tomorrow/sooner if the current runs complete.

novosirj commented 4 years ago

So far so good! Will keep you posted, but those that have gotten to this point in the script are running well (the list generation takes awhile).

novosirj commented 4 years ago

Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem.

Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10).

hjmangalam commented 4 years ago

I generally try to shoot for about 3-5 times the number of rsyncs. That way the rssyncs are kept busy with too much churn and fpart can generate chunks quickly enough that you're not waiting forever to start. Are you using fpart to recurse the dirs or are you using GPFS to generate the lists that you want to transfer. Generating via lists (esp with the --trustme option - you provide the sizes) should be much faster than a full recursion. I realize that startup time is significant on large FSs and I'm still thinking about how to improve on this delay. fpart does a very good job already. The only thing that might improve this performance is multi-threading the recursion, which I believe qdirstat does. But then all those threads are all stat-hammering and it's not clear if that's going to lead to a huge improvement. Harry

On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski notifications@github.com wrote:

Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem.

Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-624463932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q .

--

Harry Mangalam

hjmangalam commented 4 years ago

BY the way, pls let me know if your pfp runs terminate correctly. they have on all my tests of the latest version, but it's always /yours/ that are the critical ones.

harry

On Wed, May 6, 2020 at 8:12 AM Harry Mangalam hjmangalam@gmail.com wrote:

I generally try to shoot for about 3-5 times the number of rsyncs. That way the rssyncs are kept busy with too much churn and fpart can generate chunks quickly enough that you're not waiting forever to start. Are you using fpart to recurse the dirs or are you using GPFS to generate the lists that you want to transfer. Generating via lists (esp with the --trustme option - you provide the sizes) should be much faster than a full recursion. I realize that startup time is significant on large FSs and I'm still thinking about how to improve on this delay. fpart does a very good job already. The only thing that might improve this performance is multi-threading the recursion, which I believe qdirstat does. But then all those threads are all stat-hammering and it's not clear if that's going to lead to a huge improvement. Harry

On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski notifications@github.com wrote:

Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem.

Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-624463932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q .

--

Harry Mangalam

--

Harry Mangalam

novosirj commented 4 years ago

Oh, I'm sorry that that wasn't clear: the list generation happens via GPFS, so well before we run PFP. Probably 90 mins of that before we even get to PFP on the "cache" or "projects" filesystem on the one campus. But so far both runs on "cache" and "projects" have run, displayed the transfer output, and the "cache" FS completed normally already. None of the other ones have completed yet, but are behaving and relatively close to the end.

I did notice that the list generation inside PFP did seem to take a bit longer -- like I could watch it generate list parts and it was over a minute each on one of the sites and about a minute total on the others (those used to be nearly instantaneous), but it's not an amount of time that manners given how long this runs for. It also seems like maybe this frmlist.tmp file is new, large, and potentially the source of some of the slowdown as that directory doesn't have the best write performance. It also seems to contain exactly the same information as the file we feed in, but I'm assuming it does some sort of massaging/validation on that file?

novosirj commented 4 years ago

This was a little weird, but it got past it.

17.34.14   1127.03     5.63       3.04 / 0.00             4    <>   0          [17] of [17]
        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-05-06]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
INFO: Waiting [0]s for next chunkfile..
18.30.18   1183.10     0.99       2.43 / 0.00             4    <>   0          [17] of [17]
novosirj commented 4 years ago

We can leave this one open for another go-round, but I suspect it's solved in 1.67. No issue this time, and it seems likely it was caused by having a number of parts that was smaller than the number of rsync processes.

novosirj commented 4 years ago

Had a different sort of issue this week. As you may recall, we've had the problem where the PFP runs forever waiting for a new chunkfile. This week, on the same/habitual trouble campus, PFP exited way before the rsyncs from PFP exited. I have all of the cache files so we can attempt to isolate the problem if that's helpful. Also, we have no backup currently running (the tuning we did improved the speed greatly), so I have a little more flexibility than usual.

hjmangalam commented 4 years ago

Hmm. Since this is a new problem, could you start it in a new thread?

I'll wait for the new thread to answer substantially. hjm

On Wed, May 13, 2020 at 10:46 AM Ryan Novosielski notifications@github.com wrote:

Had a different sort of issue this week. As you may recall, we've had the problem where the PFP runs forever waiting for a new chunkfile. This week, on the same/habitual trouble campus, PFP exited way before the rsyncs from PFP exited. I have all of the cache files so we can attempt to isolate the problem if that's helpful. Also, we have no backup currently running (the tuning we did improved the speed greatly), so I have a little more flexibility than usual.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/34#issuecomment-628145612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3Y6HMV6S55KB42UIVULRRLMIRANCNFSM4MUBX57Q .

--

Harry Mangalam

novosirj commented 4 years ago

No problem; they sort of interrelate, and it was in the opening to this ticket:

Possibly worth nothing that that last while loop I believe is to deal with another bug someplace where parsyncfp exits but for some reason it hasn't waited for all of the rsync process to exit. This might be related to some of the previous PID-related problems. I do believe I've seen it have to wait at this point recently, but that may be fixed with 1.65+.

But I agree, it's not really related to the title anyway. This past run did show the status output while it was running. I'll close this one if I don't see it reappear.