Open LogicalKnee opened 3 years ago
You can already do that, though the docs don't give an example. ia list
produces output with one filename per line, which you can then parallelise with ia download
and GNU Parallel, xargs
, or whichever tool you prefer. For example:
ia list identifier | xargs -P 8 -n 1 ia download identifier
Downsides: you need to repeat the item identifier, and it may be very inefficient if the item has many small files.
If this were to be implemented directly in ia
, I'd argue that aiohttp or similar is the least terrible route. Parallel requests aren't trivial with requests or PycURL as they fundamentally lack parallelism and you need to use threads (though there are of course packages implementing that, at least for requests). I'm not sure that's worth the effort though.
You can already do that, though the docs don't give an example.
Yes, that's what I was getting at in the original post; there are already ways to download in parallel with a list generated by ia
. The heart of the feature request was an integrated method to achieve the same thing. Something akin to to curl
's --parallel
(and --parallel-max
) flags.
Parallel requests aren't trivial with requests
Having a quick look around at how to achieve this with requests
, I'd tend to agree. The general consensus seems to be "don't" or "use multiprocessing
"; the latter requiring careful consideration to ensure threads are handled correctly.
On the other had, pycurl
contains a CurlMulti
class which is a wrapper around libcurl
's parallel features. pycurl
provide sample usage of this functionality.
this reminds me of youtube-dl where it can actually use external downloaders to download (like aria2c, etc)
idk if that's complicated though
Here's an example of how you could download files from an item concurrently as well:
ia list nasa | parallel 'ia download nasa {}'
I'll leave this open in case others have feedback, but I personally think this is best handled with external tools like parallel
or xargs
.
The docs provide examples for using GNU Parallel to perform tasks simultaneously. However this appears to be limited to operations at an item level. For the use case of downloading an entire item containing many large files, performing the downloads in parallel would provide a significant speed boost. While it is currently possible to achieve this with external tools (e.g. obtaining a file list with
ia
then usingcurl
/wget
withparallel
), it would be nice ifia
supported this natively.While this feature could be implemented with the existing
requests
library, I assume it would likely tie in with any effort to port topycurl
(https://github.com/jjjake/internetarchive/issues/244, https://github.com/jjjake/internetarchive/issues/247).