download of a large folder only starts after the archive has been created

butonic commented 1 month ago

When downloading a folder using the archive download the web ui shows an activity visualization bar at the top after making the request to the server. Unfortunately, the download does not immediately start streaming but seems to hang until the archive has been fully assembled on the server side. This may cause a proxy in between to kill the connection. Furthermore, a user might be confused because nothing is happening (apart from the activity visualization bar at the top).

We should either start streaming immediately or show a notification that explains that the server is preparing the download ... but that should actually be async ... and a completely different API for downloads then what we currently have ... so ... we should just stream.

kulmann commented 1 month ago

I would also prefer it async... see https://github.com/owncloud/web/issues/10501#issuecomment-2357798447 (the issue contains quite some input for what you wrote down here as well) Would make people, including myself, very happy if we'd have a proper solution here which doesn't die, actually compresses the content and is async. Most of all, the harsh archiver limitations (number of files and total filesize) make it pretty much unusuable.

jvillafanez commented 1 month ago

Streaming seems a short term solution, assuming we can stream the archive right away. However, I don't think it's a good solution.

Let's say you want to archive a folder which contains 100 files spread into multiple folders. You start streaming the archive right away with no waiting time (as far as I know, at least for ".zip" and ".tar" files, it should be possible), however, an error happens while streaming the 47th file (the file is locked, random I/O error...). In that scenario, there is nothing we can do:

If we cut the stream, the user would have downloaded a corrupted 10GB archive and wasted hours.
If we swallow the error and keep going, the user would download an archive with missing files. The archive itself might be fine (it isn't corrupted and can be opened normally), but the user will be forced to verify that all the files he wanted are present, which, at least, will be annoying.
If we keep retrying, there is no guarantee that the client-server connection won't be idle enough to be cut. We don't know if the error can be solved just by retrying a couple of seconds later. If it can't be solved automatically in a timely manner, we're stuck with the 2 previous options: cut the stream, or swallow and keep going.

A "JobQueue" service might be a nice solution, and could also provide additional features that could be interesting to implement in the future.

The job queue is intended to be per user, and limited to 2-5 running jobs per user, with a maximum limit of maybe 50 running jobs (all parameters configurable). The API can contain common methods such as "create/queue job", "list jobs", "check job status/progress", "remove job". Web can provide a nice UI for all these methods so the user can control his own job queue.

As for this ticket, it could be solved by implementing an "archive" job that would archive the target folder and leave the result either in the same parent folder on in the requested one. An interaction example: right click in the folder -> choose jobs in the menu -> choose archive -> fill popup with the requested options -> done. Then he can check the "jobs" menu to check the state of the job and do other things meanwhile. When the job finishes, he can go to the target folder and download the archive file as any other regular download.

The good thing about this solution is that it can be extended for future use. We could implement on-demand virus scanning, AI image generation, on-demand thumbnail generation, massive auto-tagging based on content (which might require content analysis of the files)...

kulmann commented 1 month ago

Nice idea @jvillafanez ! We already discussed a kind of Workflow Engine in the past. Seems to go in the same direction. Little bit of context: https://github.com/owncloud/ocis/issues/7437

jvillafanez commented 1 month ago

Yes, but at the same time no. There are a couple of big differences:

The job queue is intended to be very user-driven while the workflow seems mostly (if not fully) automated. It's the user the one that puts the jobs in the queue, as well as remove them when they're done.
Visibility is important in the job queue. The user can check the status of the job at any time, whether the job has finished correctly or has some error, the progress of the job (if possible). On the other hand, the workflow just "shows" the final result: old versions deleted, notification email sent, etc. The affected user doesn't know when a workflow started or if it's running.
The job queue is intended to be very limited to the user because we don't to overload the system. A user waiting to archive a second folder because the first one is still running shouldn't be a problem. However, we probably can't do the same with workflow.

We could merge both ideas by providing a system queue only accessible to admins, or plan the job queues to have permissions (probably just read and write permissions to see and add jobs in the queue) so the admins were the only ones that could check the system queue. These system queues (maybe just one, but there could be more) could have their own limitations, higher than the regular ones.

Note that with those changes, we'd need to track additional information, mainly for the system queue: who triggered the job, at what time the job was queued, at what time it started...

In any case, these are just ideas that will need research and planning, as well as proper scoping.

owncloud / ocis

download of a large folder only starts after the archive has been created #10242