inveniosoftware / product-rdm

InvenioRDM Product Roadmap
10 stars 0 forks source link

Files: Bulk download all #86

Closed Herrner closed 7 months ago

Herrner commented 4 years ago

We advice people not to upload their data in .zip files so that people can decide what to download, but this makes it harder for people to download the whole thing (without using a "download all" extension).

It would be nice to have a "download all" button, or, even better, let users select which files to download as a bulk. I haven't found a way to download more than one file with a browser without packing them in some kind of archive beforehand, tough...

Herrner commented 4 years ago

Well, there seem to be solutions for downloading multiple files at once

https://stackoverflow.com/a/29606450/4362759

tmorrell commented 4 years ago

I second this. We have a lot of records where we recommend .zip files because downloading files individually is too difficult. A download all button would solve this issue

saragon02 commented 4 years ago

Seconding "download all" and "select files to download" options.

lnielsen commented 4 years ago

Related to zenodo/zenodo#210

I fully agree to that there needs to be a download all button, but the issue is non-trivial to solve.

Technically this would involve:

  1. Pressing "download all" will issue a request to create a background task queue to create a package if not already created. Once the package is ready the user should be notified.
  2. Background task:
    • would have to zip up all the files, and put them on a temporary storage area that's large enough to host many datasets simultaneously
    • would need to guard against one or more users starting too many tasks to exceed the storage space.
    • once ready the user should be notified (email or via websocket) what if it takes a very long time.
  3. Another background task:
    • Needs to run a clean-up task to remove zip'ed packages after ~48 hours.

Additional complexities to take into account:

Another idea, would be to package everything up in a Submission Information Package and serve out this package, but if the SIP is moved to offline storage this idea wouldn't work too well.

All in all, this is a pretty self-contained task if anyone is up for grabbing it :-) Get in touch with me if you're interested :-) I imagine this would be a new invenio module, something like "invenio-files-packaging/downloader" or similar.

Herrner commented 4 years ago

@lnielsen I see your reservations. If doing it server-side with packaging, it probably would have to use tar (or uncompressed zip for Windows users) to keep the load on the IO side. But seeing the issues this involves, I think this calls more for a client side solution (the way multi-file-downloader extensions work).

(Just collecting while I search...)

lnielsen commented 4 years ago

I only have reservations for the timeline :-) It makes fully sense with a download all.

There's pros/cons with all solutions - client-side could definitely work as a first version and we could later expand.

The cons with client-side is you don't gain any speed up - e.g. many small files will take long time to download whereas if you package them up you can better use bandwidth (HTTP protocol overhead).

Similarly, you could consider a streaming solution that doesn't need extra storage, but this one suffers from taking up connection slots on the application layer for a very long time (because you are restricted by the user's download speed, not how fast you can actually send them).

Perhaps there's ways to offload it to an the external storage system.

egabancho commented 4 years ago

I have come upon this package which might be useful for this task https://github.com/BuzonIO/zipfly.

github-actions[bot] commented 3 years ago

This issue was automatically marked as stale.

lnielsen commented 2 years ago

Could potentially be solved with having OCFL objects written to disk as zip files and serve out the OCFL object

dfdan commented 2 years ago

The structure of a true OCFL object is likely to confuse end-users (especially where the it represents multiple versions of the record). The zip would have inventory file(s) and (potentially multiple) content sub-directories (for different versions) which would contain different files depending on which version they were introduced with. A user would either have to read+understand the inventory, or go searching manually for the file(s) they want.

However it is likely the same logic could produce a consolidated content folder for the required version, and return only that. I think this is probably more in line with what end users would expect?

lnielsen commented 7 months ago

Done