htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Online initialization #13

Closed organisciak closed 7 years ago

organisciak commented 7 years ago

When the Rsync subprocess is done (#9), it would be nice to initialize volumes that haven't been downloaded yet.

For example:

fr = FeatureReader(ids=['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600'])
for volume in fr.volumes():
     volume.do_something()

In the generator, every time a volume is called, the file for the ID can be downloaded to a temporary location, read to memory, and deleted. If HTRC implements an HTTP download, that would be better, as the download can go straight into memory.

organisciak commented 7 years ago

Code done and new tests passing: https://github.com/htrc/htrc-feature-reader/tree/online_read

Rather than using the Rsync subprocess, I implemented it around an HTTP download point. The one blocking factor is that HTRC doesn't yet officially support the web downloader, and the URL is temporary. @borice, let me know when we have a permanent one.

borice commented 7 years ago

This has been done. The mashup and volume checker from David are here: http://data.analytics.hathitrust.org/htrc-mashup/VolumeCheck