htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
39 stars 12 forks source link

feature request: wrap rsync downloads inside module. #9

Closed bmschmidt closed 7 years ago

bmschmidt commented 7 years ago

I find it useful to be able to download from inside python to a local store organized in a pairtree structure. Seems like it might be nice to allow this via a function in htrc_features.utils

I'm currently just doing this:

def download_file(htid, root_dir):
    if not root_dir.endswith("/"): root_dir += "/"
    dest_file = htrc_features.utils.id_to_rsync(htid)
    args = ["rsync", "-R","-a","-v", "data.analytics.hathitrust.org::features/" + dest_file, root_dir]
    subprocess.call(args)

Calling a subprocess is not the best way to do this, prob. But TBH I don't really understand how you're using rsync (i.e., what ::features is doing).

organisciak commented 7 years ago

Thanks Ben, I'll implement a version of this into htrc_features.utils. I've been struggling with rsync support: I really want to integrate it, but am apprehensive about subprocesses. There's no simple path to a native Python rsync to make life easier for Windows users, so better a subprocess than nothing.

organisciak commented 7 years ago

There is experimental HTTP download support that is currently on David Bainbridge's server (https://bedrock.resnet.cms.waikato.ac.nz/vol-checker/VolumeCheck?download-id={{volume-id}}); that could be the fallback for Windows users, but it wouldn't be appropriate for large downloads.

organisciak commented 7 years ago

The first version is in the experimental branch: https://github.com/htrc/htrc-feature-reader/commit/ad3d9630b771e1ff97fcbe443813a1544b78ced7

Still needs docs, tests, a check for whether the software on the system, and an error on non-zero return codes (where Python <3.5). Otherwise, should be useful without being overengineered.

Example:

htids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']
utils.download_file(htids)

Changes from your code:

Example of non-default args:

In [1]: from htrc_features import utils

In [2]: htids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']

In [3]: returncode, output = utils.download_file(htids, outdir='test_download', keep_dirs=True, silent=False)

In [4]: print(output)
[sandbox] Welcome to the HathiTrust Research Center rsync server.

receiving file list ... done
nyp/
nyp/pairtree_root/
nyp/pairtree_root/33/
nyp/pairtree_root/33/43/
nyp/pairtree_root/33/43/30/
nyp/pairtree_root/33/43/30/42/
nyp/pairtree_root/33/43/30/42/06/
nyp/pairtree_root/33/43/30/42/06/88/
nyp/pairtree_root/33/43/30/42/06/88/94/
nyp/pairtree_root/33/43/30/42/06/88/94/33433042068894/
nyp/pairtree_root/33/43/30/74/
nyp/pairtree_root/33/43/30/74/94/
nyp/pairtree_root/33/43/30/74/94/35/
nyp/pairtree_root/33/43/30/74/94/35/92/
nyp/pairtree_root/33/43/30/74/94/35/92/33433074943592/
nyp/pairtree_root/33/43/30/74/94/36/
nyp/pairtree_root/33/43/30/74/94/36/00/
nyp/pairtree_root/33/43/30/74/94/36/00/33433074943600/

sent 342 bytes  received 542 bytes  252.57 bytes/sec
total size is 377,248  speedup is 426.75
organisciak commented 7 years ago

Whoops, broke Python 2.7. https://travis-ci.org/htrc/htrc-feature-reader/jobs/210651580#L766

Will fix shortly.

organisciak commented 7 years ago

Bug fixed. The polishing tasks still outstanding.

bmschmidt commented 7 years ago

Is the bugfix for 2.7 in the online branch? I've had to workaround on 2.7, but maybe it's just not pushed?

organisciak commented 7 years ago

It's still checked into the experimental branch; that build isn't throwing errors on Travis, as the syntax error previously was, but without tests I don't if less blatant errors exist.

Carving out an hour right now to work on tests and documentation, toward merging to master.

bmschmidt commented 7 years ago

Yeah you'll see this then, but it's broken then because you use subprocess.call only if

if major <= 3 and minor < 5:

but that conditions isn't met for 2.7.

Easiest probably just

if major < 3 or minor < 5:
organisciak commented 7 years ago

Ah, thanks! I'm struggling to get a version of 2.7 running on Bash for Windows, whereas I don't have rsync on regular Windows.

I reversed the order of the check to be more clear. if (major >=3 and minor >=5), then use the recommended subprocess.run, else use call.

organisciak commented 7 years ago

I'm not sure what the best way to proceed with the tests is because I've now made the tests only runnable on systems with rsync. Ultimately, if it's fine on Travis (seems to be) then maybe it's not worth fretting over.

organisciak commented 7 years ago

Done in 9e1b13d.