hjmangalam / parsyncfp2

MultiHost parallel rsync wrapper
Other
52 stars 6 forks source link

parsyncfp2

a MultiHost parallel rsync wrapper writ in Perl. by Harry Mangalam hjmangalam@gmail.com Released under GPL v3.

(Changes moved to the bottom of this file)

Background

NB: If you don't want to transfer at least 10s of GB across a network, this is probably not the the tool you want. Use rsync alone if you need or will need a sync operation, or scp if the data needs to be encrypted.

parsyncfp2 (aka pfp2) is the next generation of the family that started with parsync, which with Ganael LaPlanche's fpart, begat parsyncfp (aka pfp), which has further mutated into the MultiHost multi-send, multi-receive organism unimaginatively called parsyncfp2.

Like parsyncfp, which uses fpart to aggregate files into chunks (or partitions) to allocate to individual rsyncs, pfp2 operates similarly. The main difference between them is that pfp2 can spread the send and receive functions among multiple hosts (with a shared filesystem required on the sending side.) As with pfp, it collects files based on aggregate size into chunkfiles which can be fed to rsync on a chunk by chunk basis. This allows pfp to begin transferring files before the complete recursive descent of the source dir is complete. This feature can save many hours of prep time on very large dir trees. In addition, pfp2 can re-use the chunkfiles so generated so if there's an interruption, you can skip the re-generation of the chunkfile list (which is pretty fast, but for a PB filesystem can still take a long time and generate a lot of competing IO)

NB: recently fpart changed from starting its chunk files from 0 to 1, and this version of pfp2 is the first github release that tracks that change. Using fpart 1.5.1 works fine, as do the last couple of releases.

If your use involves transit over IB networks, parsyncfp requires 'perfquery' and 'ibstat', Infiniband utilities written by Hal Rosenstock < hal.rosenstock [at] gmail.com >

pfp2 is tested on Linux. The MacOSX port is in hibernation.

pfp2 needs to be installed only on the SOURCE end of the transfer and only works in local SOURCE -> remote TARGET mode (it won't allow remote local SOURCE <- remote TARGET, emitting an error and exiting if attempted). It requires that ssh shared keys be set up prior to operation see here. If it detects that ssh keys are NOT set up correctly, it will ask for permission to try to remedy that situation. Check your local and remote ssh keys to make sure that it has done so correctly. Typically, they're in your ~/.ssh dir.

It uses whatever rsync is available on the TARGET. It uses a number of Linux-specific utilities so if you're transferring between Linux and a FreeBSD host, install pfp2 on the Linux side.

Installation

Installation of 'parsyncfp2' is fairly simple. There's not yet a deb or rpm package, but the bits to make it work that are not part of a fairly standard Linux distro are the Perl scripts parsyncfp2, scut (like cut but a bit more flexible), and stats (spits out descriptive statistics of whatever is fed to it).
The rest of the dependents are listed here:

Required utilities and packages

Should the above commands not fulfill the requirements or be missing from your set of repositories, the utilities are listed below.

Recommended Utilities

Changes

stats

2.59

2.571

2.57

2.56

2.55

2.51

2.44

2.43

2.42

2.41

2.40

2.39

2.38

2.37

2.3

2.00 - 2.39

2.00 (as pfppod)

(Multihost ITERation) Apr 22, 2021. Lots of changes...

1.72

(California Lockdown) Dec 6, 2020, No option changes. Intercepted rsync options to forbid those that increase verbosity to avoid collision with pfp's IO handling. Including: -v/-verbose, --version, -h/--help

1.71

(GoVote) Nov 2, 2020. No option changes. Changed how checking for external utilities works. Separates the required from recommended utilities and now continues with a WARN if it doesn't find the recommended utils.

1.70

(Silverado Fire), Oct 27, 2020. No option changes. Fixed bug about setting up the fpart command (not a problem with fpart, just coercing names with spaces to be represented correctly).

1.69

(Covid Synchronicity), Aug 17, 2020. No option changes, but included a significant change in the way that pfp reads the chunk files that fpart provides. Before this versio, pfp checked only for the existence of the chunk files and could therefore launch an rsync instance on a filelist that had not been completed. If rsync overran the files, This might happen when a fileset that had already been mostly transferred, and so it could theoretically exit before fpart finished the writing to the file, leaving some files unsync'ed.

pfp now uses fpart's '-W' option to run a post-file-close script to move the finished and closed file to the processing directory, assuring that the chunk files are not read before fpart is finished writing to them. Thanks again to Ganael Laplanche (fpart author) for discussion and suggesting the simplest way to address the problem.

Also some better checking for nonsense or non-existent files/dirs.

1.65

1.64

1.63

1.61

1.60

1.58

IMPORTANT NOTE (May 31, 2019)

Thanks to the long-suffering efforts of Jeff Dullnig, I've discovered that when parsyncfp goes thru multiple suspend/unsuspend cycles, it fails to correctly rsync all the src files to the target.

If the '--maxload' option is kept high enough to avoid any suspensions, it syncs correctly.

If you're using parsyncfp now, please be aware that if forked rsyncs cycle thru suspend / unsuspends you will probably not end up with a correct target. I'll be working on this to determine if it can be fixed or if that 'feature' has to be removed.

1.57

1.56

1.54

1.53

1.50

1.47

1.46