jjjake / internetarchive

A Python and Command-Line Interface to Archive.org
GNU Affero General Public License v3.0
1.63k stars 219 forks source link

Feature request: download multiple files from a collection in parallel #412

Open LogicalKnee opened 3 years ago

LogicalKnee commented 3 years ago

The docs provide examples for using GNU Parallel to perform tasks simultaneously. However this appears to be limited to operations at an item level. For the use case of downloading an entire item containing many large files, performing the downloads in parallel would provide a significant speed boost. While it is currently possible to achieve this with external tools (e.g. obtaining a file list with ia then using curl/wget with parallel), it would be nice if ia supported this natively.

While this feature could be implemented with the existing requests library, I assume it would likely tie in with any effort to port to pycurl (https://github.com/jjjake/internetarchive/issues/244, https://github.com/jjjake/internetarchive/issues/247).

JustAnotherArchivist commented 3 years ago

You can already do that, though the docs don't give an example. ia list produces output with one filename per line, which you can then parallelise with ia download and GNU Parallel, xargs, or whichever tool you prefer. For example:

ia list identifier | xargs -P 8 -n 1 ia download identifier

Downsides: you need to repeat the item identifier, and it may be very inefficient if the item has many small files.

If this were to be implemented directly in ia, I'd argue that aiohttp or similar is the least terrible route. Parallel requests aren't trivial with requests or PycURL as they fundamentally lack parallelism and you need to use threads (though there are of course packages implementing that, at least for requests). I'm not sure that's worth the effort though.

LogicalKnee commented 3 years ago

You can already do that, though the docs don't give an example.

Yes, that's what I was getting at in the original post; there are already ways to download in parallel with a list generated by ia. The heart of the feature request was an integrated method to achieve the same thing. Something akin to to curl's --parallel (and --parallel-max) flags.

Parallel requests aren't trivial with requests

Having a quick look around at how to achieve this with requests, I'd tend to agree. The general consensus seems to be "don't" or "use multiprocessing"; the latter requiring careful consideration to ensure threads are handled correctly.

On the other had, pycurl contains a CurlMulti class which is a wrapper around libcurl's parallel features. pycurl provide sample usage of this functionality.

laptopsftw commented 3 years ago

this reminds me of youtube-dl where it can actually use external downloaders to download (like aria2c, etc)

idk if that's complicated though

jjjake commented 3 years ago

Here's an example of how you could download files from an item concurrently as well:

ia list nasa | parallel 'ia download nasa {}'

I'll leave this open in case others have feedback, but I personally think this is best handled with external tools like parallel or xargs.