nextcloud / desktop

💻 Desktop sync client for Nextcloud
https://nextcloud.com/install/#install-clients
GNU General Public License v2.0
2.99k stars 785 forks source link

[Bug]: Large File Synchronizations Fail due to Hardcoded Timeout Value in Desktop Client #5394

Open Ourewaeller opened 1 year ago

Ourewaeller commented 1 year ago

⚠️ Before submitting, please verify the following: ⚠️

Bug description

I am running a Nextcloud server 25.0.3 and use the Windows Desktop Client 3.6.6 on several Windows 10 installations. While working with the Nextcloud Virtual Drive / Files in Windows I encountered issues with large virtual disk files I wanted to synchronize via the virtual drive from my Windows clients to the Nextcloud server. These files are up to 120GB in sizes, but could be larger.

Regardless of what I tried, the Desktop Client aborted the sychnronization of such files 30 minutes after the upload progress bar had reached 100% reporting a "Connection timed out" error message.

So I started to dig deeper. This is what I came up with while testing with a 77GB file.

Synchronization of larger files via the Desktop Client consists of two major stages for files which do not yet exist on the Nextcloud server. In the first stage, the Desktop Client uploads the file in chunks to a temporary upload folder on the Nextcloud server. Once this is completed, the Desktop Client asks the Nextcloud server to assemble these chunks back to one file at the destination folder.

The first stage of the upload works fine. The large file gets chunked and uploaded to the upload folder of the Nextcloud server. While this is ongoing, the Desktop Client continuously updates the remaining time and the progress bar on its "Settings" screen. Once all chunks have been uploaded, the status information of the Desktop Client changes to "A few seconds left". Then it starts the second stage of the synchronization run.

The Desktop Client sends a MOVE command to the Nextcloud server and starts waiting for the reply to this request. The Nextcloud server begins to assemble the chunks at the final folder. While the Nextcloud server is assembling, the Desktop Client keeps showing the "A few seconds left" status message and visually seems to be "stuck". However it is still maintaining the connection to the Nextcloud server waiting for the reply to the MOVE command.

Based on its size and the speed of the disk drives of my Nextcloud server, assembling my 76GB test file takes about 40 minutes (sometimes even more). In case it already exists on the Nextcloud server, the overall processing time roughly doubles. Because its previous version needs to be copied by the Nextcloud server to the files_versions folder prior to the MOVE operation.

After waiting for 30 minutes on the response to the MOVE command from the Nextcloud server, the Desktop Client terminates the connection with the Nextcloud server and displays a "Connection timed out" error message.

The Nextcloud server however does not mind and finishes the MOVE operation properly. Following the time out, the Desktop Client marks the transfer as incomplete and starts the next attempt to synchronize the file. Because the file already exists on the Nextcloud server, the server creates a new version of the file, starts to assemble the chunks of the new upload, and while doing so, the Desktop Client runs into the next 30 minute timeout and the procedure starts all over again.

To make things even worse, after the second timeout the Desktop Client detects that there are chunks left over on the Nextcloud server which it believes belong to a failed synchronization. And because of that, it requests the deletion of those chunks from the Nextcloud server. The server deletes the chunks in a second thread, while the first thread initiated by the timed out connection is still assembling. As a result, the assembling thread fails in the middle of its execution, because the remaining chunks are no longer available. It stops, leaving a partially assembled fragment of the original file at the destination folder behind. Hence the second synchronization creates a corrupted new version of the file on the server.

While trying to find the origin of that 30 minute timeout, I checked the source code of the Desktop Client and detected, that it is caused by a hardcoded maximum value inside of the method PropagateUploadFileCommon::adjustLastJobTimeout of the file libsync\propagateupload.cpp.

In order to verify my assumption, I built my own version of the Desktop Client with that value set to 120 minutes (which would still cause issues with files larger than mine). I was able to confirm that this time my files synchronized like expected. The Desktop Client did not run into the 30 minute timeout. It waited until the finish of the MOVE operation and completed the synchronization successfully with the green check mark.

The related method contains a formular which calculates the MOVE timeout based on the size of the files. That value would have worked for me. But it limits the calculated value to 30 minutes (hard coded). This might make sense to avoid "stuck" Desktop Client synchronization runs, but for larger files which need longer to synchronize leads to this exact problem.

To make a long story short. I would really appreciate if the hard coded limit would be increased or - which would even be better - could be set or disabled by using a configuration parameter of the Desktop Client.

I am really sorry for the long post, but it took me almost a week to figure out why my uploads aborted. So I wanted to share as much information as possible.

Please consider changing this behaviour in one of the future releases. Thank you very much for all of your past and future contributions to this project.

Steps to reproduce

  1. Move a file larger than approx. 75GB to a folder of your virtual drive
  2. Wait for the syncrhonization to start
  3. Wait until the progress bar of the Settings screeen is at 100% and the status message above the status bar changes to "A few seconds left". Note the time.
  4. Wait for 30 minutes, the Desktop Client will report a "Connection timed out" error

Expected behavior

The synchronization will successfully finish without a timeout error.

Which files are affected by this bug

libsync\propagateupload.cpp - method PropagateUploadFileCommon::adjustLastJobTimeout

Operating system

Windows

Which version of the operating system you are running.

Windows 10

Package

Appimage

Nextcloud Server version

25.0.3

Nextcloud Desktop Client version

3.6.6

Is this bug present after an update or on a fresh install?

Fresh desktop client install

Are you using the Nextcloud Server Encryption module?

Encryption is Disabled

Are you using an external user-backend?

Nextcloud Server logs

No response

Additional info

No response

mzed2k commented 8 months ago

Hello, happy 2024! Any update on this? I have the exact same problem on our system. Benchmark is a 100GB file. Tested with all recent version of the Windows Desktop Client, with v3.11 as latest version. After the third attempt the desktop client marks the file with a green check and considers the transfer to be successful. The users don't have any means to determine a corrupt upload, only by downloading the file again. Corrupting files and marking them as OK is REALLY BAD! The same file uploaded with the WebUI is working fine, but that is not a solution. The same file uploaded with rclone works. With chunking enabled it throws an error, but the file is usually merged successfully on the server. With chunking disabled the upload works fine. But this is also not a solution.

I fully support @Ourewaeller 's suggestions for resolving this issue.

Best, Martin

tdebatty commented 5 months ago

My 2 cents... Seems like the problem gets worse when NextCloud app "Antivirus for files" is enabled.

I guess it's related, and occurs for smaller files if the app "Antivirus for files" is enabled...

mzed2k commented 5 months ago

Just a recent observation (v.3.12.xx+) Not only a large file is not fully uploaded, but this corrupt file is synced back to the source, destroying the source file.

rkrig commented 2 weeks ago

Just wanted to chime in. I was evaluating Nextcloud E2EE for our organisation. Our use case for this was to have users be able to securely backup their google takeout data as well as their thunderbird profiles to Nextcloud using E2EE.

Especially google takeout resulted in very large files, e.g. 50g-150g. I've been running some tests and basically, E2EE is useless for anything that is larger than a couple of MB.

I was trying to simulate certain scenarios, e.g. if a user splits their file into 5GB files. So in the case of the 150GB file, I created 30 5GB files to see how nextcloud would handle this. Short story, complete failure, borked my E2EE completely, now I can't reset E2EE feature.

The way the whole thing works is just not very well optimized. So for example you dump 30x5GB files in a folder on your system, which is supposed to be E2EE encrypted by the client. The client starts a sync run for the 30 files, it starts to encrypt one of those files and creates a temporary file in the /tmp folder while it does this. Once that file is encrypted it starts to upload it. The file is split into chunks and remotely lands in a users special temporary "uploads" folder. Once all chunks have been transmitted the server then assembles them and moves them into the actual target directory.

As mentioned before in this thread, here is a big problem. So the client has a hardcoded time limit for how long it waits after uploading the last chunk. The server takes its time, the client gets a 504 status back for the move command and thinks the operation failed. So it starts over, creates ANOTHER tempfile without deleting the old one, and starts from scratch. At some point the client just stops syncing because of either missing files or sync conflicts, etc...

First of all, why is there a time limit in the client anyways? The client should poll the status from the server and be aware of the progress. I ran these tests as a single user. If that had been a realworld scenario with multiple users, that would have created chaos.

In this particular scenario the sync never properly finishes because of timeout issues and what not, your files end up in limbo and your /tmp folder just keeps growing until you no longer have space.

There needs to be a more intelligent way of handling this. E.g. once a file has been encrypted and transmitted, remove it from the /tmp folder. Secondly, the client shouldn't wait for the MOVE command, but simply issue the request and periodically query the server if that request has completed or not. Once it has, then consider the file "properly" synced.

The way it is now, E2EE is useless for anything more than a handful of few files. Especially if you value your data.