hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf
Other
161 stars 19 forks source link

Can't seem to get parsyncfp 1.63 or newer to work right #33

Open novosirj opened 4 years ago

novosirj commented 4 years ago

Here's an example of what happens, tested with parsyncfp 1.65 with the -d flag so we could see what is going on. It doesn't correctly do the fpart section, and it seems like never generates a list based on my input list?

[root@quorum03 bin]# env | grep -ie homedir -ie snapshot -ie mountpoint -ie backuphost                                                            BACKUPHOST=nas1                                                                                                                                   SNAPSHOT=projectsc
MOUNTPOINT=projects
HOMEDIR=/gpfs/cache/home/root/policy
[root@quorum03 bin]# /usr/local/bin/parsyncfp-1.65 -d -i ens6 --checkperiod 1800 --nowait --altcache '/root/.parsyncfp-backups' -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=10T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT
DEBUG[359]: FPARTSIZE set to: [10T]
FPARTSIZE_N set to [10995116277800]
Press [ENTER] to continue.

DEBUG[380]: Using network interface [ens6] with connection quality [ 100000Mb/s]
Press [ENTER] to continue.

DEBUG[385]: 1m load is [0.85] and the 1m Load:#CPU ratio is [0.0354166666666667] ( [24] CPU cores).
    OK to continue.
Press [ENTER] to continue.

WARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-backups/fpcache].
  Enter ^C to stop this.
  If you specified '--nowait', cache will be cleared in 3s regardless.
  Otherwise, hit [Enter] and I'll clear them. 

INFO: The fpart chunk files [/root/.parsyncfp-backups/fpcache/f*] are cleared .. continuing.
INFO: The TRIMPATH you specified exists, is a dir, and is readable.
INFO: Alternative file list is readable; converting list to chunks.
DEBUG[624]: fpart fork cmd:
[cd /projects/.snapshots/projectsc; fpart -v -L -s 10995116277800 -a  -i /tmp/pfp.tmp -o /root/.parsyncfp-backups/fpcache/f 2> /root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29 & echo "${!}" > /root/.parsyncfp-backups/fpcache/FP_PIDFILE00.41.53_2020-04-29]
Press [ENTER] to continue.
INFO: Forking fpart with PID = [6639].  Check [/root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29] for errors if it hangs.
INFO: Waiting for fpart to be forked.
INFO: [/root/.parsyncfp-backups/fpcache/f.0] visible.
INFO: Waiting [1]s for chunk files to be written.
INFO: Starting the 1st [12] rsyncs ..
DEBUG[685]: Complete rsync cmd = [cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000  -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no" -a -s --log-file=/root/.parsyncfp-backups/rsync-logfile-00.41.53_2020-04-29_0 --files-from=/root/.parsyncfp-backups/fpcache/f.0  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc  & echo "${!}" >> /root/.parsyncfp-backups/fpcache/rsync-PIDs-00.41.53_2020-04-29]
Press [ENTER] to continue.

DEBUG[704]: OUT OF RSYNC STARTUP LOOP
Press [ENTER] to continue.

WARN: 
    The number of chunk files generated by fpart [1] < the # of rsync 
    processes you specified [12].  
    Did you check the dir tree / file list to make sure you're setting the chunk 
    size appropriately (--chunksize) ?  It's currently set to [10T].

        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-04-29]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
DEBUG[1105]: sPIDs string in get_rPIDs() = []
Press [ENTER] to continue.
^C
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(638) [sender=3.1.2]

[root@quorum03 bin]#

I'm not sure why this happens. Here's what the fpart log looks like (this wasn't the exact same run, but this is the same as it looked like; I didn't save the output):

[root@quorum03 ~]# cat /root/.parsyncfp-backups/fpcache/fpart.log.00.40.37_2020-04-29
Examining filesystem...
error parsing input values:
Filled part #0: size = 0, 1 file(s)
1 file(s) found.

I'm not sure what's going on here, or how to debug. Originally, I didn't have stats and scut installed. An issue on my system is that there is a stats command that belongs to Confluent that's installed on the system and in the PATH. So I change the PATH with export PATH=... where PATH no longer contains the Confluent PATH with stats but does contain /usr/local/bin which is where I have stats, scut, and fpart that are required for parsyncfp. But I'm not sure if that's not working out somehow, or if something else is going wrong?

The only other thing I can think of is that our file specified by --fromlist= actually is formatted with a space between the filesize and the PATH instead of the that's specified in the documentation/usage info. This has never been a problem before, however.

For the moment, I've dropped back to 1.61 to continue my backups. Never tried 1.62.

hjmangalam commented 4 years ago

Hi Ryan, It'll take me some time to set up a similar situation on our cluster and I have some things that need doing in the daylight, so I'll try to get to this tonight. Thanks for such a complete error report. Harry

On Tue, Apr 28, 2020 at 9:58 PM Ryan Novosielski notifications@github.com wrote:

Here's an example of what happens, tested with parsyncfp 1.65 with the -d flag so we could see what is going on. It doesn't correctly do the fpart section, and it seems like never generates a list based on my input list?

MOUNTPOINT=projects HOMEDIR=/gpfs/cache/home/root/policy [root@quorum03 bin]# /usr/local/bin/parsyncfp-1.65 -d -i ens6 --checkperiod 1800 --nowait --altcache '/root/.parsyncfp-backups' -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=10T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT DEBUG[359]: FPARTSIZE set to: [10T] FPARTSIZE_N set to [10995116277800] Press [ENTER] to continue.

DEBUG[380]: Using network interface [ens6] with connection quality [ 100000Mb/s] Press [ENTER] to continue.

DEBUG[385]: 1m load is [0.85] and the 1m Load:#CPU ratio is [0.0354166666666667] ( [24] CPU cores). OK to continue. Press [ENTER] to continue.

WARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-backups/fpcache]. Enter ^C to stop this. If you specified '--nowait', cache will be cleared in 3s regardless. Otherwise, hit [Enter] and I'll clear them.

INFO: The fpart chunk files [/root/.parsyncfp-backups/fpcache/f*] are cleared .. continuing. INFO: The TRIMPATH you specified exists, is a dir, and is readable. INFO: Alternative file list is readable; converting list to chunks. DEBUG[624]: fpart fork cmd: [cd /projects/.snapshots/projectsc; fpart -v -L -s 10995116277800 -a -i /tmp/pfp.tmp -o /root/.parsyncfp-backups/fpcache/f 2> /root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29 & echo "${!}" > /root/.parsyncfp-backups/fpcache/FP_PIDFILE00.41.53_2020-04-29] Press [ENTER] to continue. INFO: Forking fpart with PID = [6639]. Check [/root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29] for errors if it hangs. INFO: Waiting for fpart to be forked. INFO: [/root/.parsyncfp-backups/fpcache/f.0] visible. INFO: Waiting [1]s for chunk files to be written. INFO: Starting the 1st [12] rsyncs .. DEBUG[685]: Complete rsync cmd = [cd /projects/.snapshots/projectsc && rsync --bwlimit=1000000 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no" -a -s --log-file=/root/.parsyncfp-backups/rsync-logfile-00.41.53_2020-04-29_0 --files-from=/root/.parsyncfp-backups/fpcache/f.0 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups/fpcache/rsync-PIDs-00.41.53_2020-04-29] Press [ENTER] to continue.

DEBUG[704]: OUT OF RSYNC STARTUP LOOP Press [ENTER] to continue.

WARN: The number of chunk files generated by fpart [1] < the # of rsync processes you specified [12]. Did you check the dir tree / file list to make sure you're setting the chunk size appropriately (--chunksize) ? It's currently set to [10T].

    | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-04-29]

Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo] DEBUG[1105]: sPIDs string in get_rPIDs() = [] Press [ENTER] to continue. ^C rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(638) [sender=3.1.2]

[root@quorum03 bin]#```

I'm not sure why this happens. Here's what the fpart log looks like (this wasn't the exact same run, but this is the same as it looked like; I didn't save the output):


Examining filesystem...
error parsing input values:
Filled part #0: size = 0, 1 file(s)
1 file(s) found.```

I'm not sure what's going on here, or how to debug. Originally, I didn't have `stats` and `scut` installed. An issue on my system is that there is a stats command that belongs to Confluent that's installed on the system and in the PATH. So I change the PATH with `export PATH=...` where PATH no longer contains the Confluent PATH with `stats` but does contain `/usr/local/bin` which is where I have `stats`, `scut`, and `fpart` that are required for `parsyncfp`. But I'm not sure if that's not working out somehow, or if something else is going wrong?

The only other thing I can think of is that our file specified by `--fromlist=` actually is formatted with a space between the filesize and the PATH instead of the <tab> that's specified in the documentation/usage info. This has never been a problem before, however.

For the moment, I've dropped back to 1.61 to continue my backups. Never tried 1.62.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/hjmangalam/parsyncfp/issues/33>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y7N7JAEFQTAFFHNMBTRO6XYTANCNFSM4MTMXI4A>
.

--

Harry Mangalam

novosirj commented 4 years ago

Any help is welcome whenever it comes; no real rush for us. I won't be trying anything until next Tuesday most likely (we run this weekly at 23:00 EST/EDT on Tuesdays).

hjmangalam commented 4 years ago

Hi Ryan, Hmm - I haven't duplicated your bug, but I can see where this kind of situation could go wrong. If you ask for 12 rsyncs and only provide enough work for 1, it might very well be that it hangs waiting for enough work for 12. It does give the warning, but I can't quite remember how it handles that case. I'll check in the morning. Again, too late for touching code here. Actually, this is part of the same code refactoring I was thinking about when I mentioned the faster startup, so it's a timely bug. More tomorrow. hjm

On Wed, Apr 29, 2020 at 1:33 PM Ryan Novosielski notifications@github.com wrote:

Any help is welcome whenever it comes; no real rush for us. I won't be trying anything until next Tuesday most likely (we run this weekly at 23:00 EST/EDT on Tuesdays).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hjmangalam/parsyncfp/issues/33#issuecomment-621447551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASF3YYZVA4TTSRGTLRD62TRPCFIVANCNFSM4MTMXI4A .

--

Harry Mangalam

novosirj commented 4 years ago

I'll have to have it try to do that transfer again when it wraps up work on that filesystem to see what the content of the fpart files is; the amount of source data changes from week to week, theoretically, so it might not happen next week. I never considered what might happen if the parts are broken up in 10TB chunks and there aren't more than a couple of them.

novosirj commented 4 years ago

1.67 seems to have solved this problem.