Closed bmschmidt closed 7 years ago
Thanks Ben, I'll implement a version of this into htrc_features.utils
. I've been struggling with rsync
support: I really want to integrate it, but am apprehensive about subprocesses. There's no simple path to a native Python rsync to make life easier for Windows users, so better a subprocess than nothing.
There is experimental HTTP download support that is currently on David Bainbridge's server (https://bedrock.resnet.cms.waikato.ac.nz/vol-checker/VolumeCheck?download-id={{volume-id}}); that could be the fallback for Windows users, but it wouldn't be appropriate for large downloads.
The first version is in the experimental branch: https://github.com/htrc/htrc-feature-reader/commit/ad3d9630b771e1ff97fcbe443813a1544b78ced7
Still needs docs, tests, a check for whether the software on the system, and an error on non-zero return codes (where Python <3.5). Otherwise, should be useful without being overengineered.
Example:
htids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']
utils.download_file(htids)
Changes from your code:
rsync --files-from=/tmp/tmpfilelocation
root_dir
is outdir
./
Example of non-default args:
In [1]: from htrc_features import utils
In [2]: htids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']
In [3]: returncode, output = utils.download_file(htids, outdir='test_download', keep_dirs=True, silent=False)
In [4]: print(output)
[sandbox] Welcome to the HathiTrust Research Center rsync server.
receiving file list ... done
nyp/
nyp/pairtree_root/
nyp/pairtree_root/33/
nyp/pairtree_root/33/43/
nyp/pairtree_root/33/43/30/
nyp/pairtree_root/33/43/30/42/
nyp/pairtree_root/33/43/30/42/06/
nyp/pairtree_root/33/43/30/42/06/88/
nyp/pairtree_root/33/43/30/42/06/88/94/
nyp/pairtree_root/33/43/30/42/06/88/94/33433042068894/
nyp/pairtree_root/33/43/30/74/
nyp/pairtree_root/33/43/30/74/94/
nyp/pairtree_root/33/43/30/74/94/35/
nyp/pairtree_root/33/43/30/74/94/35/92/
nyp/pairtree_root/33/43/30/74/94/35/92/33433074943592/
nyp/pairtree_root/33/43/30/74/94/36/
nyp/pairtree_root/33/43/30/74/94/36/00/
nyp/pairtree_root/33/43/30/74/94/36/00/33433074943600/
sent 342 bytes received 542 bytes 252.57 bytes/sec
total size is 377,248 speedup is 426.75
Whoops, broke Python 2.7. https://travis-ci.org/htrc/htrc-feature-reader/jobs/210651580#L766
Will fix shortly.
Bug fixed. The polishing tasks still outstanding.
Is the bugfix for 2.7 in the online branch? I've had to workaround on 2.7, but maybe it's just not pushed?
It's still checked into the experimental branch; that build isn't throwing errors on Travis, as the syntax error previously was, but without tests I don't if less blatant errors exist.
Carving out an hour right now to work on tests and documentation, toward merging to master.
Yeah you'll see this then, but it's broken then because you use subprocess.call only if
if major <= 3 and minor < 5:
but that conditions isn't met for 2.7.
Easiest probably just
if major < 3 or minor < 5:
Ah, thanks! I'm struggling to get a version of 2.7 running on Bash for Windows, whereas I don't have rsync on regular Windows.
I reversed the order of the check to be more clear. if (major >=3 and minor >=5)
, then use the recommended subprocess.run, else use call
.
I'm not sure what the best way to proceed with the tests is because I've now made the tests only runnable on systems with rsync
. Ultimately, if it's fine on Travis (seems to be) then maybe it's not worth fretting over.
Done in 9e1b13d.
I find it useful to be able to download from inside python to a local store organized in a pairtree structure. Seems like it might be nice to allow this via a function in
htrc_features.utils
I'm currently just doing this:
Calling a subprocess is not the best way to do this, prob. But TBH I don't really understand how you're using rsync (i.e., what
::features
is doing).