hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf
Other
161 stars 19 forks source link

Check for fparts_already_running warns about unrelated fpart processes and does not respect --nowait #15

Open novosirj opened 6 years ago

novosirj commented 6 years ago

I run parsyncfp from a script. For the first time this week, we ran into a scenario where we wanted to run two of them at once, requiring us to use --altcache to separate the cache directories. There are some minor bugs in this functionality. For one:

ls: cannot access /root/.parsyncfp-backups/: No such file or directory

[root@quorum ~]# /usr/local/bin/parsyncfp --altcache '/root/.parsyncfp-backups' -NP 4 --chunksize=100G --rsyncopts '--stats -e "ssh -T -c arcfour -o Compression=no -x"' --maxbw=1250000 --startdir '/gpfs/home/.snapshots' home nas1:/zdata/gss

  WARN: about to remove all the old cached chunkfiles from [/root/.parsyncfp-backups/fpcache].
  Enter ^C to stop this.
        If you specified '--nowait', cache will be cleared in 3s regardless.
  Otherwise, hit [Enter] and I'll clear them.
Press [ENTER] to continue.

However, as we've seen, there were no old cachefiles there. The cause is the way that altcache works -- it creates the directory at the time when it sets $parsync_dir on lines 83-84:

if (! defined $ALTCACHE) {$parsync_dir = $HOME . "/.parsyncfp";} else {$parsync_dir = $ALTCACHE; mkdir $parsync_dir;}

That can be avoided with --nowait; it still does a slightly wrong thing, but it doesn't torpedo the script. And really, I like running without --nowait because it catches my errors re: not removing the cachefiles or other unusual circumstances.

The part that does really cause a problem though is the section where fparts_already_running is checked. Even though an alternate cache directory is called out, the check for fparts does not try to exclude any fparts from the process list, on line 265:

my $fparts_already_running = `ps aux | grep 'fpar[t]'`; chomp $fparts_already_running;

I guess my inclination would be to add a second grep along the lines of the following:

my $fparts_already_running = `ps aux | grep 'fpar[t]' | grep -- "-o $parsync_dir/"`; chomp $fparts_already_running;

Though I'm reluctant to just suggest this as a patch without having read more of the code. HTH, anyway!

hjmangalam commented 6 years ago

Thanks for this bug report. I'll try to look at it later today. This feature has not been exercised very much - you're probably the 1st person to use it IRL, but good to get a bug report.

harry

On Tuesday, August 28, 2018 8:30:38 AM PDT novosirj wrote:

I run parsyncfp from a script. For the first time this week, we ran into a scenario where we wanted to run two of them at once, requiring us to use --altcache to separate the cache directories. There are some minor bugs in this functionality. For one:

`[root@quorum ~]# ls -al /root/.parsyncfp-backups/ ls: cannot access /root/.parsyncfp-backups/: No such file or directory

[root@quorum ~]# /usr/local/bin/parsyncfp --altcache '/root/.parsyncfp-backups' -NP 4 --chunksize=100G --rsyncopts '--stats -e "ssh -T -c arcfour -o Compression=no -x"' --maxbw=1250000 --startdir '/gpfs/home/.snapshots' home nas1:/zdata/gss

WARN: about to remove all the old cached chunkfiles from [/root/.parsyncfp-backups/fpcache]. Enter ^C to stop this. If you specified '--nowait', cache will be cleared in 3s regardless. Otherwise, hit [Enter] and I'll clear them. Press [ENTER] to continue. `

However, as we've seen, there were no old cachefiles there. The cause is the way that altcache works -- it creates the directory at the time when it sets $parsync_dir on lines 83-84:

if (! defined $ALTCACHE) {$parsync_dir = $HOME . "/.parsyncfp";} else {$parsync_dir = $ALTCACHE; mkdir $parsync_dir;}`

That can be avoided with --nowait; it still does a slightly wrong thing, but it doesn't torpedo the script. And really, I like running without --nowait because it catches my errors re: not removing the cachefiles or other unusual circumstances.

The part that does really cause a problem though is the section where fparts_already_running is checked. Even though an alternate cache directory is called out, the check for fparts does not try to exclude any fparts from the process list, on line 265:

my $fparts_already_running = `ps aux | grep 'fpar[t]'`; chomp $fparts_already_running;

I guess my inclination would be to add a second grep along the lines of the following:

my $fparts_already_running = `ps aux | grep 'fpar[t]' | grep -- "-o $parsync_dir/"`; chomp $fparts_already_running;

Though I'm reluctant to just suggest this as a patch without having read more of the code. HTH, anyway!

Harry Mangalam, Info[1]


[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

novosirj commented 6 years ago

It remains to be seen whether it would actually be faster to run two of these at once vs. just running them sequentially, given the fighting over resources that the two fparts are doing, but it would be good to have it behave properly.

hjmangalam commented 6 years ago

Could you try the new version? I think I've fixed (or addressed) the alt-cache problem and the already running fparts.

I still have to do some better checking about losing track of running PIDs (and relatedly) how to cycle more rsyncs when the job is mostly already transferred.

Thanks again for the bug-reports.

hjm

On Tuesday, August 28, 2018 8:30:38 AM PDT novosirj wrote:

I run parsyncfp from a script. For the first time this week, we ran into a scenario where we wanted to run two of them at once, requiring us to use --altcache to separate the cache directories. There are some minor bugs in this functionality. For one:

`[root@quorum ~]# ls -al /root/.parsyncfp-backups/ ls: cannot access /root/.parsyncfp-backups/: No such file or directory

[root@quorum ~]# /usr/local/bin/parsyncfp --altcache '/root/.parsyncfp-backups' -NP 4 --chunksize=100G --rsyncopts '--stats -e "ssh -T -c arcfour -o Compression=no -x"' --maxbw=1250000 --startdir '/gpfs/home/.snapshots' home nas1:/zdata/gss

WARN: about to remove all the old cached chunkfiles from [/root/.parsyncfp-backups/fpcache]. Enter ^C to stop this. If you specified '--nowait', cache will be cleared in 3s regardless. Otherwise, hit [Enter] and I'll clear them. Press [ENTER] to continue. `

However, as we've seen, there were no old cachefiles there. The cause is the way that altcache works -- it creates the directory at the time when it sets $parsync_dir on lines 83-84:

if (! defined $ALTCACHE) {$parsync_dir = $HOME . "/.parsyncfp";} else {$parsync_dir = $ALTCACHE; mkdir $parsync_dir;}`

That can be avoided with --nowait; it still does a slightly wrong thing, but it doesn't torpedo the script. And really, I like running without --nowait because it catches my errors re: not removing the cachefiles or other unusual circumstances.

The part that does really cause a problem though is the section where fparts_already_running is checked. Even though an alternate cache directory is called out, the check for fparts does not try to exclude any fparts from the process list, on line 265:

my $fparts_already_running = `ps aux | grep 'fpar[t]'`; chomp $fparts_already_running;

I guess my inclination would be to add a second grep along the lines of the following:

my $fparts_already_running = `ps aux | grep 'fpar[t]' | grep -- "-o $parsync_dir/"`; chomp $fparts_already_running;

Though I'm reluctant to just suggest this as a patch without having read more of the code. HTH, anyway!

Harry Mangalam, Info[1]


[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

novosirj commented 6 years ago

Will test this tonight on our next round of backups. Thanks!

novosirj commented 6 years ago

This seems to have worked well this time, and when the final rsync ended, so did parsyncfp. Did you make any changes to the PID code, or is that just good luck?

Thanks.