Open tokee opened 1 year ago
Addendum: Maybe the timeouts are not an issue as throttling takes place at the inner paging stage of the export so starting 20 concurrent downloads simply means 20 slowly trickling downloads instead of x active downloads and 20-x waiting downloads.
Exporting a WARC that takes up hundreds of gigabytes is unfeasible: Tool support is dubious and the risk of an aborted transfer due to timeouts is real.
As the export size of the individual parts of a WARC is approximately known, it should be possible to generate a list of download links, each resulting in a WARC of a given size, e.g. 1 gigabyte. This would require underlying support for exporting subsets of a result set as well as GUI support for providing such lists of download links. The situation where the user manually starts all the downloads at the same time should also be handled: If downloads are queued, some of the downloads are likely to timeout due to a long period with no activity. Possibly subsequent links could be inactive until the previous parts has been fully downloaded?