Automatically recover from timeouts (QNetworkReply::OperationCanceledError)

pentix commented 3 years ago

Situation

The NextCloud sync client is installed on a Windows client, running mostly 24/7 connecting to a server over an unstable network. Sometimes the connection drops, however NextCloud enters a failed state which requires manual intervention, which is very unhandy, considering the fix is only to close and reopen the sync client.

Logs

The (anonymized) log shows the following timeout error:

2020-11-01 18:22:41:791 [ warning nextcloud.sync.networkjob ]:  QNetworkReply::OperationCanceledError "Zeitüberschreitung bei der Verbindung" QVariant(Invalid)
2020-11-01 18:22:41:791 [ warning nextcloud.sync.credentials.webflow ]: QNetworkReply::OperationCanceledError
2020-11-01 18:22:41:791 [ warning nextcloud.sync.credentials.webflow ]: "Operation abgebrochen"
2020-11-01 18:22:41:791 [ info nextcloud.sync.networkjob.lscol ]:       LSCOL of QUrl("https://cloud.xxxxxx.xx/remote.php/dav/files/company/FolderNameABCDE") FINISHED WITH STATUS "OperationCanceledError Zeitüberschreitung bei der Verbindung"
2020-11-01 18:22:41:791 [ warning nextcloud.sync.discovery ]:   LSCOL job error "Operation abgebrochen" 0 QNetworkReply::OperationCanceledError
2020-11-01 18:22:41:791 [ debug nextcloud.sync.discovery ]      [ OCC::DiscoveryMainThread::singleDirectoryJobFinishedWithErrorSlot ]:  5 "Operation abgebrochen"
2020-11-01 18:22:41:791 [ debug nextcloud.sync.networkjob ]     [ OCC::AbstractNetworkJob::slotFinished ]:      Network job OCC::LsColJob finished for "/FolderNameABCDE"
2020-11-01 18:22:41:791 [ debug nextcloud.sync.discovery ]      [ OCC::DiscoveryJob::remote_vio_opendir_hook ]: OCC::DiscoveryJob(0x263cdd7270)  ...Returned from main thread
2020-11-01 18:22:41:791 [ debug nextcloud.sync.discovery ]      [ OCC::DiscoveryJob::remote_vio_opendir_hook ]: 5 when opening  msg= "Operation abgebrochen"
2020-11-01 18:22:41:791 [ warning nextcloud.sync.csync.updater ]:       opendir failed for  - errno 5
2020-11-01 18:22:41:791 [ warning nextcloud.sync.engine ]:      ERROR during  csync_update :  "Beim Öffnen eines Ordners ist ein Fehler aufgetreten. Operation abgebrochen"
2020-11-01 18:22:41:791 [ warning nextcloud.gui.activity ]:     Item  "D:\\daten\\company_bk\\files\\files_folder"  retrieved resulted in  "Beim Öffnen eines Ordners ist ein Fehler aufgetreten. Operation abgebrochen"
2020-11-01 18:22:41:791 [ info nextcloud.gui.activity ]:        Error successfully added to the notification list:  "Beim Öffnen eines Ordners ist ein Fehler aufgetreten. Operation abgebrochen"
2020-11-01 18:22:41:791 [ info nextcloud.sync.database ]:       Closing DB "C:/Users/Administrator/AppData/Roaming/Nextcloud/._sync_574955ced5da.db-shm"
2020-11-01 18:22:41:791 [ debug nextcloud.sync.database ]       [ OCC::SyncJournalDb::commitTransaction ]:      No database Transaction to commit
2020-11-01 18:22:41:791 [ info nextcloud.sync.engine ]: CSync run took  7200002 ms
2020-11-01 18:22:41:791 [ info nextcloud.gui.folder ]:  Client version 3.0.2stable-Win64 (build 20200924)  Qt 5.12.9  SSL  OpenSSL 1.1.1g  21 Apr 2020
2020-11-01 18:22:41:791 [ warning nextcloud.gui.folder ]:       SyncEngine finished with ERROR
2020-11-01 18:22:41:791 [ info nextcloud.gui.folder ]:  Folder sync result:  1
2020-11-01 18:22:41:791 [ info nextcloud.gui.folder ]:  the last 8 syncs failed
2020-11-01 18:22:41:791 [ debug nextcloud.gui.folder ]  [ OCC::Folder::slotSyncFinished ]:      Sync failed, keeping last sync's local discovery path list
2020-11-01 18:22:41:791 [ info nextcloud.gui.application ]:     Sync state changed for folder  "https://cloud.xxxxxx.xx/remote.php/dav/files/company/FolderNameABCDE" :  "Error"

Steps to reproduce

Add one synced folder to NextCloud
Let it sync for a while
Let the connection fail for a short amount of time
Client reports "Operation cancelled: An error occured while opening a folder"
Right-clicking and "Enforce sync now" does not seem to work either, even if the connection is back. Need to restart the sync client.

This occurs even there are ridiculously high connection timeout values set.

Client configuration

Client version: Client version 3.0.2stable-Win64 (build 20200924) Qt 5.12.9 SSL OpenSSL 1.1.1g 21 Apr 2020

Operating system: Windows Server 2008

OS language: German

Installation path of client: Default path

pentix commented 3 years ago

Probably relevant to this:

A GET timeout should possibly only be considered a SoftError?

https://github.com/nextcloud/desktop/blob/6445a61d807974be2148ef424b3efd9ba291fe3e/src/libsync/propagatedownload.cpp#L326

Would it be required to set propagator()->_anotherSyncNeeded = true in such cases?

https://github.com/nextcloud/desktop/blob/6445a61d807974be2148ef424b3efd9ba291fe3e/src/libsync/propagatedownload.cpp#L614

er-vin commented 3 years ago

I'd say going to SoftError would be dangerous (it might be assumed down the line that the file is in a "better" state than it is). Maybe reclassifying as a NormalError would be doable but that'd need to be heavily tested.

Obviously I'd rather we find a proper fix to avoid those timeouts in the first place. They're supposed to be transient situation that's why the FatalError is a safe bet: when that happens it's likely everything else asked to the server will fail as well with further timeouts. Thus continuing the sync is generally not going to help get the server load down for instance.

If that's something happening often, it's better to fix the root cause which can be:

badly configured server or under powered server
a bug in some server component
something done by the client triggering a high load

For the last one there's obviously something at play around E2EE right now, spotted in 3.0.2, tried to workaround it for 3.0.3. I'm in discussion with the E2EE app maintainer to try to move this forward.

pentix commented 3 years ago

Obviously I'd rather we find a proper fix to avoid those timeouts in the first place.

Absolutely! I know the connectivity is not the best and I'm not blaming NextCloud for it, there is actually a different bunch of factors to this.

Thus continuing the sync is generally not going to help get the server load down for instance.

I agree constant retrying might not be a good approach, however I think a retry every few minutes would not hurt. Going to the FatalError state and waiting for a person to manually restart sync is not really reducing the workload either :grimacing:

I assume if the status is set to NormalError, the normal sync process will still be triggered? (i.e. 30s later?)

I'm in discussion with the E2EE app maintainer to try to move this forward.

Thanks a lot!

er-vin commented 3 years ago

I assume if the status is set to NormalError, the normal sync process will still be triggered? (i.e. 30s later?)

Got to admit I'm not 100% sure but that'd be my expectation though. It might ignore the file for a little while as well (it's generally not long though, for the next sync or so).

pentix commented 3 years ago

Thank you.

I'd love to try this (i.e. insert some debug logs for me to understand the error handling and classify such errors as NormalError) but I was struggling setting up the entire Windows Build Environment. In case that would work out eventually, I'll report my insights back to you.

er-vin commented 3 years ago

I've been working on a check list for windows builds. Paint is still fresh that's why I didn't commit it yet. If you want to try that alternative route and let me know how this goes for you here it is (in your context the dokany step is optional):

We're assuming the following to be installed before this procedure starts:

Windows 10 SDK
Visual Studio 2019

Steps:

In the Visual Studio installer pick the 'desktop development with c++' workload
Install Qt: https://download.qt.io/official_releases/online_installers/
- Select the latest LTS release, 64bit MSVC 2017 variant
- Select QtWebEngine
- Select source and debug symbols
Install Dokany: https://github.com/dokan-dev/dokany/releases (use a Dokan_x64.msi, check all the development options)
Install chocolatey: https://chocolatey.org/install
In an admin cmd:
- choco install -y git.install openssl cmake ninja
- RefreshEnv
Open a git cmd then:
- C:\Qt\5.12.9\msvc2017_64\bin\qtenv2.bat
- "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
- git clone https://github.com/madler/zlib
- mkdir zlib\build
- cd zlib\build
- cmake -GNinja -DCMAKE_INSTALL_PREFIX=../../root ..
- ninja
- ninja install
- cd ..\..
- git clone https://github.com/frankosterfeld/qtkeychain
- mkdir qtkeychain\build
- cd qtkeychain\build
- cmake -GNinja -DCMAKE_INSTALL_PREFIX=C:\Qt\5.12.9\msvc2017_64 ..
- ninja
- ninja install
- cd ..\..
Open a new admin cmd then:
- setx PATH "%PATH%;path\to\root"
- RefreshEnv

pentix commented 3 years ago

Thank you! I'll give this a shot as soon as I find some time and a clean Windows VM.

(When looking for the build process used I saw the appveyor.yml files but I guess they are outdated now, at least non-functional at the moment?)

er-vin commented 3 years ago

Yes... I'd love to see someone revive the appveyor so that we got Windows coverage. Didn't find time for it yet.

pentix commented 3 years ago

So far things went smooth, but what root are you referring to?

* Open a new admin cmd then:

  * setx PATH "%PATH%;path\to\root"

er-vin commented 3 years ago

So far things went smooth, but what root are you referring to?

The root folder in which zlib got installed. See a few steps above. Not necessarily nicely named... I'll change it.

pentix commented 3 years ago

Ah whoops, I totally missed this when copying the other instructions.

I assume we add the bin folder inside the root folder to the PATH, not the root folder itself?

Edit: Wrong assumption, it's only used later on to find zlib development files, I figured

0xpr03 commented 3 years ago

Any updates on this ? It's kinda bad if the client can't recover from connection failure which can happen through updates, outages etc and people may not notice or have enough experience to handle it. I've had this multiple times now.

0xpr03 commented 3 years ago

Also the task icon stays on error, it's also displayed in the main app. But the menu you can get when clicking on the connection with recent file transfers says that everything is ok.

github-actions[bot] commented 3 years ago

This bug report did not receive an update in the last 4 weeks. Please take a look again and update the issue with new details, otherwise the issue will be automatically closed in 2 weeks. Thank you!

0xpr03 commented 3 years ago

The issue is the same as in 2020. Same problems. Can we mark this as non-resolved or do I need a bot now to counter this bot ?

pentix commented 3 years ago

Any updates on this ? It's kinda bad if the client can't recover from connection failure which can happen through updates, outages etc and people may not notice or have enough experience to handle it. I've had this multiple times now.

Same! In the meantime, I've identified the error in the network that leads to those sudden disconnects. As I stated in the beginning, NextCloud is not the cause. It seems to reconnect sometimes, for example if the network just goes out, it reconnects later on. If, however, the requests just time out, or there's a temporary certificate error (due to e.g. SSL reverse proxy being down) it would just stop and not retry at all. I modified the nextcloud-client, going the "most aggressive" way, i.e. setting the error to SoftError and setting anotherSyncRequired = true, whenever there's an HTTP error. Sadly, the above mentioned errors persist :(

0xpr03 commented 3 years ago

Yeah the problem ist that if you get spurious network failures the client will just stop working, even though it may be only 1-2 minutes and working on the rest of the day. But people will then be disconnected until somebody tells them to restart their client.

pentix commented 3 years ago

@0xpr03 Exactly this!

Maybe @er-vin could check it out? Do you have a guess in what state the client might hang, once it received the QNetworkReply::OperationCanceledError?

er-vin commented 3 years ago

Maybe @er-vin could check it out? Do you have a guess in what state the client might hang, once it received the QNetworkReply::OperationCanceledError?

Unfortunately not... the problem with this kind of seemingly random events is that they're hard to reproduce at all.

0xpr03 commented 3 years ago

Maybe @er-vin could check it out? Do you have a guess in what state the client might hang, once it received the QNetworkReply::OperationCanceledError?

Unfortunately not... the problem with this kind of seemingly random events is that they're hard to reproduce at all.

Setup nextcloud, connect from the desktop client, kill the nextcloud VM, restart after ~5 minutes and watch it burn.

github-actions[bot] commented 3 years ago

This bug report did not receive an update in the last 4 weeks. Please take a look again and update the issue with new details, otherwise the issue will be automatically closed in 2 weeks. Thank you!

0xpr03 commented 3 years ago

Still active

github-actions[bot] commented 3 years ago

This bug report did not receive an update in the last 4 weeks. Please take a look again and update the issue with new details, otherwise the issue will be automatically closed in 2 weeks. Thank you!

0xpr03 commented 3 years ago

Still active

CarlosCastaneda commented 3 years ago

I can confirm the behavior. For me, the error occurs regularly and exactly when the router (Fritz!Box) closes and reopens the Internet connection before it is forcibly disconnected and re-established by the provider once a day. The Nextcloud client polls the server about every 30 seconds and if during this time the connection is being closed and re-established, the client does not get back on its feet by itself (usually with an SSL handshake error). The only solution is to terminate the client and restart it. The client should at least be fault-tolerant enough to try to reestablish the connection as if the client was restarted.

I have added a small part of the log. If desired I can provide more infos (log files, etc...).

2021-04-29 01:25:59:832 [ warning nextcloud.sync.networkjob ]: SslHandshakeFailedError: "Im Ablauf des SSL-Protokolls ist ein Fehler aufgetreten." : can be caused by a webserver wanting SSL client certificates 2021-04-29 01:25:59:832 [ warning nextcloud.sync.networkjob ]: QNetworkReply::SslHandshakeFailedError "Im Ablauf des SSL-Protokolls ist ein Fehler aufgetreten." QVariant(Invalid) 2021-04-29 01:25:59:832 [ warning nextcloud.sync.credentials.webflow ]: QNetworkReply::SslHandshakeFailedError 2021-04-29 01:25:59:832 [ warning nextcloud.sync.credentials.webflow ]: "Im Ablauf des SSL-Protokolls ist ein Fehler aufgetreten." 2021-04-29 01:25:59:832 [ info nextcloud.sync.networkjob.propfind ]: PROPFIND of QUrl("https://xxxxxxxxx.yyyyyyyyy.myfritz.net/remote.php/dav/files/XXXX/") FINISHED WITH STATUS "SslHandshakeFailedError Im Ablauf des SSL-Protokolls ist ein Fehler aufgetreten." 2021-04-29 01:25:59:832 [ warning nextcloud.sync.networkjob.propfind ]: not successful, http result code is 0 "" 2021-04-29 01:25:59:832 [ info nextcloud.gui.account.state ]: AccountState connection status change: OCC::ConnectionValidator::Connected -> OCC::ConnectionValidator::SslError 2021-04-29 01:25:59:832 [ info nextcloud.gui.account.state ]: AccountState state change: "Verbunden" -> "Abgemeldet" 2021-04-29 01:25:59:832 [ debug nextcloud.gui.socketapi ] [ OCC::SocketListener::sendMessage ]: Sending SocketAPI message --> "UNREGISTER_PATH:O:\Nextcloud" to QLocalSocket(0x1c55f17a5e0) 2021-04-29 01:25:59:895 [ info nextcloud.gui.folder.manager ]: Account "XXXX@xxxxxxxxx.yyyyyyyyy.myfritz.net" disconnected or paused, terminating or descheduling sync folders 2021-04-29 01:25:59:895 [ debug nextcloud.sync.networkjob ] [ OCC::AbstractNetworkJob::slotFinished ]: Network job OCC::PropfindJob finished for "/" 2021-04-29 01:25:59:895 [ info nextcloud.sync.credentials.webflow ]: request finished 2021-04-29 01:25:59:895 [ warning nextcloud.sync.networkjob ]: SslHandshakeFailedError: "Im Ablauf des SSL-Protokolls ist ein Fehler aufgetreten." : can be caused by a webserver wanting SSL client certificates 2021-04-29 01:25:59:895 [ warning nextcloud.sync.networkjob ]: QNetworkReply::SslHandshakeFailedError "Im Ablauf des SSL-Protokolls ist ein Fehler aufgetreten." QVariant(Invalid) 2021-04-29 01:25:59:895 [ warning nextcloud.sync.credentials.webflow ]: QNetworkReply::SslHandshakeFailedError 2021-04-29 01:25:59:895 [ warning nextcloud.sync.credentials.webflow ]: "Im Ablauf des SSL-Protokolls ist ein Fehler aufgetreten."

and the following never ends...

2021-04-29 01:26:05:151 [ info nextcloud.gui.folder.manager ]: Etag poll timer timeout 2021-04-29 01:26:05:151 [ info nextcloud.gui.folder.manager ]: Folders to sync: 1 2021-04-29 01:26:05:151 [ info nextcloud.gui.folder.manager ]: Number of folders that don't use push notifications: 1 2021-04-29 01:26:05:151 [ info nextcloud.gui.folder.manager ]: Run etag job on folder OCC::Folder(0x1c55ca95730) 2021-04-29 01:26:05:151 [ info nextcloud.gui.folder.manager ]: Can not run etag job: Folder is busy 2021-04-29 01:26:35:144 [ info nextcloud.gui.folder.manager ]: Etag poll timer timeout 2021-04-29 01:26:35:144 [ info nextcloud.gui.folder.manager ]: Folders to sync: 1 2021-04-29 01:26:35:144 [ info nextcloud.gui.folder.manager ]: Number of folders that don't use push notifications: 1 2021-04-29 01:26:35:145 [ info nextcloud.gui.folder.manager ]: Run etag job on folder OCC::Folder(0x1c55ca95730) 2021-04-29 01:26:35:145 [ info nextcloud.gui.folder.manager ]: Can not run etag job: Folder is busy

vm03 commented 3 years ago

Same tith if NC server behind VPN and VPN not up some time. 3.2.4 (macOS)

victualsquid commented 2 years ago

I have a similar problem with the client not recovering after timeout. Here the cause is that my desktop is not connected to the network while logging into X. If the client autostarts before the machine is connected to the network, even if only by a few seconds, it never recovers. Client version is 3.3.4, but it also happened with the previous version. Obviously the actual nextcloud install has no influence.

CarlosCastaneda commented 2 years ago

I have been observing this behavior for some time now and have noticed different scenarios during a planned short interruption of the fritzbox: sometimes the error occurs and the client reconnects, sometimes it doesn't. In both cases it can happen that a window "untrusted certificate" from different certificate issuers (e.g. Apple, Amazon,...) appears, but mine is from letsencrypt. If there are other logging or trace options, this could be very helpful to find out the cause of the problem. At least I could easily provide the client logfiles for each scenario.

user9931 commented 2 years ago

Dont want to create another very simular issue... but the problem are not in bad connection to the server (its happens, but its.. okay)..

Windows 10 -11 goes to sleep -> no network connection about couple of seconds -> client dont work until until click on "log in". this is very annoying to do every time the computer wakes up

johannesE commented 2 years ago

Please excuse me if this does not provide additional information. Facts I have right now:

I am clearly connected to the internet.
The nextcloud instance is reachable.
Sync client (on linux desktop) shows "Offline" without any way to change that or retry.
Restarting the sync client solves the issue. Nothing else that I know of helps.

The probably naive person in me thinks: Why is it not automatically retrying after like 60 seconds? If there is no technical reason, please point me to the file and I'll take a look as soon as I have time. It shouldn't be so hard to do even though I have no experience with the sync client and it's probably written in a language I haven't used in a while like Java.

This is important to me because my employees are not seeing the files and they are used to dropbox where "everything just works". I don't want them to think that dropbox is better just because of this bug.

0xpr03 commented 2 years ago

@johannesE https://github.com/nextcloud/desktop/issues/2610#issuecomment-722333417 https://github.com/nextcloud/desktop/issues/2610#issuecomment-722231019

Qt is a little bit more to learn.

thierrys commented 2 years ago

same problem here, on macos 10.13, tested different client versions from 2.4 to 3.4 that all end up on hold with error messages on the client logs. any idea how to overcome this major blockage?

pbrunier commented 2 years ago

Same issue on openSUSE Tumbleweed on GNOME and NetworkManager. Same issue occurs when I login and NetworkManager is not yet connected to the net. This last case can be solved using a delay on Nextcloud startup. But the sleep/suspend issue is truely breaking. I use Nextcloud for convenience but also for security by syncing to a regularly back upped location. Now the process requires me to intervene multiple times a day.

timinator2020 commented 1 year ago

This appears to be the same issue as (or rather, the root cause of) issue #4084.

I have 4 systems, 2 Linux Mint 20.4, 2 Windows 10. All of them have issues connecting after resuming from standby. To initiate a reconnect you must change the proxy settings, restart the client completely, or wait a while. It does eventually reconnect for me (most of the time), but others have reported that it never reconnects.

You can see in the logs, it appears to fail for about 3 minutes before it finally connects. In that time, I am able to use the network, including browsing to the Nextcloud web UI and logging in, so the server seems to be functioning. Other NC clients on my other systems are able to sync with the server during this time.

20221115_1357_owncloud.log.zip

bobbolous commented 1 year ago

Same here! Wake from standby, client is offline until I restart the client. At least a retry button would be nice.

user9931 commented 1 year ago

not about standby, but also when using proxy. for example, when using ssh socks or other to connect to home server, if proxy not available on client start, nextcloud client seems like not trying it again, only restart helps...

nextcloud / desktop