hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf
Other
161 stars 19 forks source link

parsyncfp

a parallel rsync wrapper in Perl. Released under GPL v3.

(Version changes moved to the bottom of this file)

See Changes at bottom.

PARSYNCFP2 ANNOUNCEMENT

Feb 24, 2022 Please check out the latest in the parsync family: parsyncfp2 (aka pfp2 - note the '2'). Version 2, besides benefiting from a number of bug-fixes and improvements vs parsyncfp (aka pfp), now allows MultiHost sends. It's still a single executable, still pretty simple, but now can spread itself over multiple hosts to send to single or multiple hosts with identical or multiple endpoints.
Please check it out.

Background

NB: If you don't want to transfer at least 10s of GB across a network, this is probably not the the tool you want. Use rsync alone if a sync, 'tar and netcat' if a fresh copy, or scp if the copy needs to be encrypted.

parsyncfp is the offspring of my parsync (more info here) and Ganael LaPlanche's fpart, which collects files based on size or number into chunkfiles which can be fed to rsync on a chunk by chunk basis. This allows pfp to begin transferring files before the complete recursive descent of the source dir is complete. This feature can save many hours of prep time on very large dir trees.

If your use involves transit over IB networks, parsyncfp requires 'perfquery' and 'ibstat', Infiniband utilities written by Hal Rosenstock < hal.rosenstock [at] gmail.com >

pfp is tested on Linux. The MacOSX port is in hibernation.

pfp needs to be installed only on the SOURCE end of the transfer and only works in local SOURCE -> remote TARGET mode (it won't allow remote local SOURCE <- remote TARGET, emitting an error and exiting if attempted). It requires that ssh shared keys be set up prior to operation see here. If it detects that ssh keys are NOT set up correctly, it will ask for permission to try to remedy that situation. Check your local and remote ssh keys to make sure that it has done so correctly. Typically, they're in your ~/.ssh dir.

It uses whatever rsync is available on the TARGET. It uses a number of Linux-specific utilities so if you're transferring between Linux and a FreeBSD host, install parsyncfp on the Linux side.

The only native rsync options that pfp uses is '-a' (archive) & '-s' (respect odd characters in filenames). If you need to define the rsync options differently, then it's up to you to define ALL of them via '--rsyncopts' (the default '-a -s' flags will NOT be provided automatically.

pfp checks to see if the current system load is too heavy and tries to throttle the rsyncs during the run by monitoring and suspending / continuing them as needed. There is a 'suspend.log' file in the ~/.parsyncfp dir that tracks these transitions.

It appropriates rsync's bandwidth throttle mechanism, using '--maxbw' as a passthru to rsync's 'bwlimit' option, but divides it by NP so as to keep the total bw the same as the stated limit. It monitors and shows network bandwidth, but can't change the bw allocation mid-job. It can only suspend rsyncs until the load decreases below the cutoff. (In the single host version of pfp, if you suspend parsyncfp (^Z), all rsync children will suspend as well, regardless of current state.) In the Multihost version, the SEND hosts are independent so ^C & ^Z will have no effect on them. A socket-based control system is next.

Unless changed by '--interface', it tries to figure out how to set the interface to monitor. The transfer will use whatever interface routing provides, normally set by the name of the target. It can also be used for non-host-based transfers (between mounted filesystems) but the network bandwidth continues to be (usually pointlessly) shown.

NB: Between mounted filesystems, parsyncfp sometimes works very poorly for reasons still mysterious. In such cases (monitor with 'ifstat'), use 'cp' or tnc for the initial data movement and a single rsync to finalize. I believe the multiple rsync chatter is interfering with the transfer.

It only works on dirs and files that originate from the current dir (or specified via "--startdir"). You cannot include dirs and files from discontinuous or higher-level dirs.

the ~/.parsyncfp files

The cache dir (~/.parsyncfp by default) contains the fpcache dir which holds the fpart log and all the PID files as well as the chunk files (f*). parsyncfp no longer provides cache reuse because the fpart chunking is so fast. The log files are datestamped and are NOT overwritten. A new option allows you to specify alternative locations for the cache as well as specifying locations for multiple instances so that many parsyncfps can run at the same time, altho they will detect each others' fparts running and question this situation at startup. The multihost version will test for rsyncs running on the SEND hosts and alert you.

Odd characters in names

parsyncfp will sometimes refuse to transfer some oddly named files, altho recent versions of rsync allow the '-s' flag (now a parsyncfp default) which tries to respect names with spaces and properly escaped shell characters. Filenames with embedded newlines, DOS EOLs, and other odd chars will be recorded in the log files in the ~/.parsyncfp dir.

Debugging

To see where you (or I) have gone wrong, look at the rsync-logfile* logs in the ~/.parsyncfp dir. If you're a masochist, use the '--debug' flag which will spew lots of gratuitous info to the screen and pause multiple times to allow checking of intermediate results.

Options

[i] = integer number
[f] = floating point number
[s] = "quoted string"
( ) = the default if any
The syntax 'longarg|shortarg' means that either the long or short form may be used to denote that option.


 --NP|np [i] (sqrt(#CPUs)) : number of rsync processes to start optimal NP depends on many vars.   Try the default and incr as needed
 --altcache|ac (~/.parsyncfp) : alternative cache dir for placing it on a another FS or for running multiple  parsyncfps simultaneously
 --startdir|sd [s] (`pwd`) : the directory it starts at(*)
 --maxbw [i] (unlimited) : in KB/s max bandwidth to use (--bwlimit  passthru to rsync).  maxbw is the total BW to be used, NOT per rsync.
 --maxload|ml [f] (NP+2)  :  max system load - if sysload > maxload, an rsync proc will sleep for 10s
 --chunksize|cs [s] (10G) : aggregate size of files allocated to one rsync  process.  Can specify in 'human' terms [100M, 50K, 1T] as well as integer bytes.
 --rsyncopts|ro [s] : options passed to rsync as quoted string (CAREFUL!) this opt triggers a pause before executing to verify the command
 --fromlist|fl [s]  +
 --trimpath|tp [s]  +-- see "Options for using filelists" below
 --trustme|tm       +
 --interface|i [s] : network interface to monitor (not use; see above)
 --checkperiod|cp [i] (5) : sets the period in seconds between updates
 --verbose|v [0-3] (2) : sets chattiness. 3=debug; 2=normal; 1=less; 0=none.    This only affects verbosity post-start; warning & error messages will still be printed.
 --dispose|d [s] (l) : what to do with the cache files. (l)eave untouched, (c)ompress to a tarball, (d)elete.
 --email [s] : email address to send completion message
 --nowait  :  for scripting, sleep for a few seconds instead of pausing & requiring user input.
 --udr : wrap rsync in the UDT lib for UDP transmissions. UDP is more filtered than TCP and may fail depending on whose networks the packets have to pass.
 --hosts [s]     +
 --hostcheck     +-- see "Options for MultiHost transfers" below
 --commondir [s] +
 --rpathpfx [s]  +
 --version : dumps version string and exits
 --help : this help

Options for using filelists

(thanks to Bill Abbott for the inspiration/guidance).

The following 3 options provide a means of explicitly naming the files you wish to transfer by means of filelists, whether by 'find' or other means. Typically, you will provide a list of files, for example generated by a DB lookup (GPFS or Robinhood) with full path names. If you use this list directly with rsync, it will remove the leading '/' but then place the file with that otherwise full path inside the target dir. So '/home/hjm/DL/hello.c' would be placed in '/target/home/hjm/DL/hello.c'.
If this result is OK, then simply use the '--fromlist' option to specify the file of files.

If the list of files are NOT fully qualified then you should make sure that the command is run from the correct dir so that the rsyncs can find the designated dirs & files.

If you want the file 'hello.c' to end up as '/target/DL/hello.c' (ie remove the original '/home/hjm'), you would use the --trimpath option as follows: '--trimpath=/home/hjm'. This will remove the given path before transferring it and assure that the file ends up in the right place. This should work even if the command is executed away from the directory where the files are rooted. If you have already modified the file list to remove the leading dir path, then of course you don't need to use '--trimpath' option.


- --fromlist|fl [s] : take explicit input file list from given file, 
                        1 path name per line.
- --trimpath|tp [s] : path to trim from front of full path name if 
                        '--fromlist' file contains full path names and 
                        you want to trim them.
- --trustme|tm : with '--fromlist' above allows the use of file lists of the form:

                        size in bytes<tab>/fully/qualified/filename/path  
                        825692            /home/hjm/nacs/hpc/movedata.txt
                        87456826          /home/hjm/Downloads/xme.tar.gz
                        etc

This allows lists to be produced elsewhere to be fed directly to pfp without a file stat() or complete recursion of the dir tree. So if you're using an SQL DB to track your filesystem usage like Robinhood or a filesystem like GPFS that can emit such data, it can save some startup time on gigantic file trees.

Options for MultiHost transfers


--hosts [s] ....... specify the Send and Receive hosts, optionally supplying
                    R hosts with individually alternate paths to store data.
                    If you specify /different/ REC paths, the SEND data will 
                   be split over those host:/path combinations, so they will
                                    have to be manually combined afterwards.
                     This is to allow different remote filesystems to accept
                        (hopefully) very high bandwidth transmission without
                                                  impacting other processes)
                         The required last element in a multihost command is
                          'POD::/path' which is the default path for any REC
                    hosts that haven't been defined in the '--hosts' option.
                    (see Good Example 4 below)
--hostcheck  ............. pre-check to make sure that the SEND & REC hosts
                    specified with '--hosts' do not have any rsyncs running.
                If they do, the number of them is reported. Those rsyncs may 
                 be valid and independent of pfp but it may be evidence of a 
                       failed pfp which may interfere with a new pfp launch.
                               In order to prevent mistakes, I rec using the 
                              '--ro' flag rather than the full '--rsyncopts'
                   This option also pushes the required utilities to the REC 
                           slave hosts to make sure that they have the exe's
                           necessary to run with full functionality.
--commondir [s] ............ the shared, common dir in which all chunk files 
                        and rsync logs will stored.  Similar to '--altcache'
                                     but MUST be readable by all SEND hosts.
--rpath [s] .............. the remote PATH prefix on the SLAVES to check for 
                         the bits needed to run this. Prefixed to the remote 
                             ssh cmd as 'export PATH=<rpath string>:\$PATH;' 
                    The rpath string can contain as many paths as you'd like
                separated by ':', tho vars have to be escaped appropriately.
                                               --rpath="~/bin:\$HOME/pfp/bin"
                 (default is ~/bin:~/.pfp, and ':\$PATH' is also appended so
                            --rpath="~/bin:\$HOME/pfp/bin"
                                        is transmitted as:
                            --rpath="~/bin:\$HOME/pfp/bin:\$PATH"
                              (This option may be removed as the --hostcheck 
                            option obviates most of the utility of --rpaths)

The host string format is a comma-delimited set of 'Send=Receive' hosts. example: "s1=r1:/path1,s2=r2:/path2,s3=r3:/path3,s4=r4,s5=r5" where each 's#' and 'r#' imply a full "login\@host" string, altho if the user on the master server is the same as on the other hosts, the 'user@' is not required (ssh rules). Also, each 'r#' can have a storage path appended (r2:/path2). If the REC path is not given, the path from the final 'POD::/path' target is appended. ie pfp [option option option..] POD::/common/default/receive/target.

The multihost command will exit once the 'fpart' chunking process is finished and leave the SEND hosts running independently on the remote hosts. They will continue to send output back to the originating terminal prefixed/suffixed with the SEND hostname so you can decipher which SEND host is saying what. However, unlike the single-host version, killing or suspending the originating program will have no effect on the SEND hosts. The remote rsyncs will have to be killed manually. The following command executed on the originating node will kill all the rsyncs on the SEND and REC hosts defined:

   hosts2hit="send1 send2 send3 rec1 rec2"
   for host in $hosts2hit; do 
     echo [ processing $host ] ; 
     ssh $host "kill -9 \` ps ux | grep rsyn[c] | awk '{print \$2}'\` " ; 
   done

(The above bash loop has been edited to render correctly in rendered markdown.) (I think.) This SEND host independence should be addressed shortly via socket-based controls.

See 'Good Example 5', below.

Examples

Good Example 1

% parsyncfp  --maxload=5.5 --NP=4 \
--chunksize=\$((1024 * 1024 * 4)) \
--startdir='/home/hjm' dir[123]  \
hjm\@remotehost:~/backups

where:

It uses 4 instances to rsync dir1 dir2 dir3 to hjm@remotehost:~/backups

Good Example 2

--interface eth0  --chunksize=87682352 \
--rsyncopts="--exclude='[abc]*'"  nacs/fabio   
hjm\@moo:~/backups

where

Good Example 3

% parsyncfp -v 1 --nowait --ac pfpcache1 --NP 4 \
--cp=5 --cs=50M --ro '-az' linux-4.8.4 moo:~/test

where

Good example 4

parsyncfp --NP=8 --chunksize=500M --fromlist=/home/hjm/dl550 \
hjm@moo:/home/hjm/test

where

Good example 5

parsyncfp --verbose=2 --ro='-aslz' \
--hosts="bigben=bridgit.ure.edu:/d1/in, \
          pooki=bridgit.ure.edu:/d2/in, \
        stunted=bridgit.ure.edu:/d3/in" \
--hostcheck  --ro='-aslz' --NP 4 --chunk 15G --check 5 --dispo=c \
--interface=wlp3s0 --commondir=/home/hjm/pfp --startdir /home/hjm/pfp \
zeba motif-2.3.6 tree-1.8.0 singularity-2.3.1 dmtcp-2.5.2  POD::/

The above multihost command shows 3 SEND hosts (bigben, pooki, stunted) all sending data to a single REC host bridgit.ure.edu altho the data is being split among 3 filesystems (mounted on /d1, /d2, /d3). You could also define 3 REC hosts, writing data to the SAME PATH if that had a better performance profile. It shows the preferred way of defining the rsyncopts with "--ro='-aslz'" and the '--dispo=c' option requests that the cachefiles be compressed and tar'ed afterwards. The 'POD::/' terminal element is the (required) default path for any undefined REC hosts (of which there aren't any in this example)

Error Example 1

% pwd
/home/hjm  # executing parsyncfp from here

% parsyncfp --NP4 --compress /usr/local  /media/backupdisk

Why this is an error:

The correct version of the above command is:

% parsyncfp --NP=4 --rsyncopts='--compress' --startdir=/usr local /media/backupdisk

Error Example 2

Why this is an error:

The correct version of the above command is:

% parsyncfp  --startdir=/usr  local  hjm\@remote:/home/hjm/mooslocal 

Error Example 3

Why this is an error:

The correct version of the above command is:

Changes

2.00 (as pfppod)

(Multihost ITERation) Apr 22, 2021. Lots of changes...

1.72

(California Lockdown) Dec 6, 2020, No option changes. Intercepted rsync options to forbid those that increase verbosity to avoid collision with pfp's IO handling. Including: -v/-verbose, --version, -h/--help

1.71

(GoVote) Nov 2, 2020. No option changes. Changed how checking for external utilities works. Separates the required from recommended utilities and now continues with a WARN if it doesn't find the recommended utils.

1.70

(Silverado Fire), Oct 27, 2020. No option changes. Fixed bug about setting up the fpart command (not a problem with fpart, just coercing names with spaces to be represented correctly).

1.69

(Covid Synchronicity), Aug 17, 2020. No option changes, but included a significant change in the way that pfp reads the chunk files that fpart provides. Before this versio, pfp checked only for the existence of the chunk files and could therefore launch an rsync instance on a filelist that had not been completed. If rsync overran the files, This might happen when a fileset that had already been mostly transferred, and so it could theoretically exit before fpart finished the writing to the file, leaving some files unsync'ed.

pfp now uses fpart's '-W' option to run a post-file-close script to move the finished and closed file to the processing directory, assuring that the chunk files are not read before fpart is finished writing to them. Thanks again to Ganael Laplanche (fpart author) for discussion and suggesting the simplest way to address the problem.

Also some better checking for nonsense or non-existent files/dirs.

1.65

1.64

1.63

1.61

1.60

1.58

IMPORTANT NOTE (May 31, 2019)

Thanks to the long-suffering efforts of Jeff Dullnig, I've discovered that when parsyncfp goes thru multiple suspend/unsuspend cycles, it fails to correctly rsync all the src files to the target.

If the '--maxload' option is kept high enough to avoid any suspensions, it syncs correctly.

If you're using parsyncfp now, please be aware that if forked rsyncs cycle thru suspend / unsuspends you will probably not end up with a correct target. I'll be working on this to determine if it can be fixed or if that 'feature' has to be removed.

1.57

1.56

1.54

1.53

1.50

1.47

1.46