Nextcloud desktop removed all files from server

rbu commented 5 years ago

Expected behaviour

Nextcloud Desktop should not remove files from the server that are not gone from the machine.

Actual behaviour

Two days ago, Nextcloud desktop removed all my files from the server, including folders shared with others. It also removed myself from shares (to me) of others.

The files on the local system stayed intact. However, recovering from this took me significant amounts of time, sweat and conflict solving. Plus, all of the folder timestamps on the server are now gone.

Steps to reproduce

I hope I cannot reproduce this.

I'm seeing frequent crashes of the Nextcloud client. There's high memory usage that I attribute this to (#124), sometimes up to 1.5 GB of RAM when it's been running for a while.

Client configuration

Client version: 2.5.3

Operating system: Fedora 30

OS language: English

Qt version used by client package (Linux only, see also Settings dialog): 5.12.4

Client package (From Nextcloud or distro) (Linux only): nextcloud-client-2.5.3-1.fc30

Server configuration

Nextcloud version: 16.0.4.1

Logs

Access log:

91.XX.XX.XX - rbu [18/Sep/2019:13:57:29 +0000] "DELETE /remote.php/dav/files/rbu/projects HTTP/1.1" 204 0 "-" "Mozilla/5.0 (Linux) mirall/2.5.3git (Nextcloud)" "91.XX.XX.XX"
...
91.XX.XX.XX - rbu [18/Sep/2019:13:57:36 +0000] "DELETE /remote.php/dav/files/rbu/pictures HTTP/1.1" 204 0 "-" "Mozilla/5.0 (Linux) mirall/2.5.3git (Nextcloud)" "91.XX.XX.XX"

rbu commented 5 years ago

This just happened to me AGAIN. It started removing everything in one subfolder and then crashed after removing a few thousand files from it. I have a complete (non debug) log of this process in the client. The first line that starts the delete is this:

[OCC::ActivityListModel::removeActivityFromActivityList         Activity/Notification/Error successfully dismissed:  "Could not remove folder '/home/rbu/Documents/X/2005' "
[OCC::ActivityListModel::removeActivityFromActivityList         Trying to remove Activity/Notification/Error from view... 
[OCC::ActivityListModel::removeActivityFromActivityList         Activity/Notification/Error successfully removed from the list.
[OCC::ActivityListModel::removeActivityFromActivityList         Updating Activity/Notification/Error view.
[_csync_merge_algorithm_visitor         INSTRUCTION_IGNORE             client dir:  X/.Y
[csync_reconcile        Reconciliation for local replica took  0.035 seconds visiting  33230  files.
[_csync_merge_algorithm_visitor         INSTRUCTION_REMOVE             server file: A/B/C
[_csync_merge_algorithm_visitor         INSTRUCTION_REMOVE             server file: A/B/D.xml
[_csync_merge_algorithm_visitor         INSTRUCTION_REMOVE             server file: A/B/E.js

Since the log file contains tons of private file names, I can't easily upload it here unfortunately.

rbu commented 5 years ago

This just happened to me a THIRD time within one week. It removed ALL the files from the server and crashed syncing the changes down.

I've disabled the client for now, as it's simply running amok on my files. I have a log file for this incident and the previous comment's incident of partial deletion that I'm happy to answer questions with.

DominiqueFuchs commented 5 years ago

Hi @rbu , really sorry for your experience with the client. However, to find the root cause for this it´s absolutely crucial to be able to reproduce this over here (if I just set up the client on my standard fedora box, connect it to my fresh NC16 docker instance with a user having a random tree of online folders nothing unexpected happens).

I understand you do not want to expose file names by posting a complete log. Few things though:

Could you use the exact same machine (/fedora installation) and use the affected client with a different testuser and throw in some dummy files? May be very helpful to see if the same issue appears after some time.
Any infos about the server instance differing from a standard installation? Any specific apps activated?
Which filesystem are you using on your client machine? XFS, ext4, btrfs?
Specific addendum to #2: is E2E activated on the server (if yes - is it activated on your folders?)

rbu commented 5 years ago

Hey @DominiqueFuchs, thanks for the response. Regarding the log, if you have specific questions regarding the log or a way to share it privately with an individual, I could do that.

That's a good idea, however I probably won't have the time to do this in the next 10 days (on vacation next week)
The server is privately hosted (on Docker/Fedora). I'll include a list of enabled plugins [1]. There's a (small) "Samba" remote share enabled in my files (which is the only thing that did not get deleted during any of those incidents).
The client is on btrfs, same is the server. There's about 40k files in over 1k directories in my personal folder.
Not activated in the client or server.

Edit: I'd like to add one thing that I found odd. I remember seeing error messages about "too many open files" (in red, part of the sync status bar) before/during one of these incidents happening. I wasn't sure if it's too many network connections, open files or maybe it failed to set up inotify listeners? Note that my client system allows a large number of inotify listeners: fs.inotify.max_user_watches=524288 I found a lot of lines like this in the log files (which could explain why the "Scans" samba share did not get deleted):

[_csync_merge_algorithm_visitor     INSTRUCTION_REMOVE             server file: Scans/2015/document2015-08-19-131546.pdf
[OCC::SyncEngine::checkForPermission    checkForPermission: RESTORING "Scans/2015/document2015-08-19-131546.pdf"
[OCC::PropagateItemJob::scheduleSelfOrChild     Starting INSTRUCTION_NEW propagation of "Scans/2015/document2015-08-19-131546.pdf" by OCC::PropagateDownloadFile(0x558b531cffd0)
[OCC::PropagateItemJob::done    Could not complete propagation of "Scans/2015/document2015-08-19-131546.pdf" by OCC::PropagateDownloadFile(0x558b531cffd0) with status 2 and error: "Not allowed to remove, restoring; Restoration Failed: Too many open files"
[OCC::ActivityWidget::slotItemCompleted     Item  "Scans/2015/document2015-08-19-131546.pdf"  retrieved resulted in  "Not allowed to remove, restoring; Restoration Failed: Too many open files"
[OCC::ActivityWidget::slotItemCompleted     Item  "Scans/2015/document2015-08-19-131546.pdf"  retrieved resulted in error  "Not allowed to remove, restoring; Restoration Failed: Too many open files"

I saw the first incident maybe 3 days after upgrading the fedora package from 2.5.2-1.fc30 to 2.5.3-1.fc30. I have since my last comment downgraded to 2.5.2-1.fc30 and haven't seen this problem again.

[1] List of enabled plugins

Accessibility
Activity
Auditing / Logging
Bookmarks
Brute-force settings
Calendar
Collaborative tags
Comments
Contacts
Deleted files
External storage support
Federation
File sharing
First run wizard
Gallery
Log Reader
Monitoring
Nextcloud announcements
Notes
Notifications
Password policy
PDF viewer
Plain text editor
Privacy
Recommendations
Right click
Share by mail
Support
Tasks
Theming
Update notification
Usage survey
Versions
Video player
Viewer
Group folders
Optical character recognition
Default encryption module
Full text search
Full text search - Elasticsearch Platform
Full text search - Files
Full text search - Files - Tesseract OCR
LDAP user and group backend
Social sharing via email
Talk

rbu commented 5 years ago

This has not happened in the 3 weeks since I downgraded from 2.5.3-1.fc30 to 2.5.2-1.fc30. I guess I'll just stay on that version for the foreseeable future :man_shrugging:

rbu commented 4 years ago

I thought I'd give this another chance and BOOM... all files are gone... again, on 2.6.2. That's after about 2 weeks of usage. I restored and had another few days of it not deleting all files.

luspi commented 4 years ago

Just wanted to say that this happened to me today, I caught the desktop client in the middle of deleting all files on the server INCLUDING files that I do not have locally but only on the server. The activity log shows no errors but lists a selection of files that were supposedly deleted . Thankfully I'm backing my server up regularly...

Desktop client: 2.6.4 Nextcloud: 18.0.1 Server OS: Ubuntu 18.04

Bun-Bun commented 4 years ago

This also happened to me today. During the night while I was asleep all the files on my server were deleted. Server was also full because the files were also being re uploaded (thus in the main storage and trashbin) by the client that deleted the files from the server. Local files on that client appear to be ok. Local files on all other connected clients were deleted.

I am in the process of cleaning this up and restoring from a server backup, but would really like to figure out how to prevent this in the future as it has taken most of my day.

Confirmed that it was the one client by looking in the servers access-logs

xxx.xxx.xxx.xxx - Bun-Bun [14/Mar/2020:04:19:16 -0600] "DELETE /remote.php/dav/files/Bun-Bun/wallpapers HTTP/1.1" 204 - "-" "Mozilla/5.0 (Windows) mirall/2.6.2stable-Win64 (build 20191224) (Nextcloud)"

I only have one client of that version installed on my home IP. One of those lines for each of the items in the root of my nextcloud folder.

Server OS: Centos 7 Desktop client: Windows 2.6.2 (Win 10 Pro 1803) Nextcloud: 14

Matthiasfranck commented 4 years ago

The samen happened to me today. Has someone found a solution to solve this? Ik would like to use Nextcloud again, but with such as risk of loosing data this is not possible.

Bun-Bun commented 4 years ago

My solution was to install Nextcloud-2.3.3.1 client and configure it to not check for updates.

That is the last version of the client before they started screwing everything up with it.

skjnldsv commented 4 years ago

Happened to me again today. Got a DELETE request for all files/folders in root.

2020-04-23T06:21:35.283574205Z 192.168.1.110 - skjnldsv [23/Apr/2020:06:21:35 +0000] "DELETE /remote.php/dav/files/skjnldsv/Wallet HTTP/2.0" 204 0 "-" "Mozilla/5.0 (Linux) mirall/2.6.4git (Nextcloud)" "192.168.1.110"
2020-04-23T06:21:35.312710439Z 192.168.1.110 - skjnldsv [23/Apr/2020:06:21:35 +0000] "DELETE /remote.php/dav/files/skjnldsv/Wallpapers HTTP/2.0" 204 0 "-" "Mozilla/5.0 (Linux) mirall/2.6.4git (Nextcloud)" "192.168.1.110"
2020-04-23T06:21:35.365646789Z 192.168.1.110 - skjnldsv [23/Apr/2020:06:21:35 +0000] "DELETE /remote.php/dav/files/skjnldsv/Ressources HTTP/2.0" 204 0 "-" "Mozilla/5.0 (Linux) mirall/2.6.4git (Nextcloud)" "192.168.1.110"
2020-04-23T06:21:35.707730232Z 192.168.1.110 - skjnldsv [23/Apr/2020:06:21:35 +0000] "DELETE /remote.php/dav/files/skjnldsv/Recipes HTTP/2.0" 204 0 "-" "Mozilla/5.0 (Linux) mirall/2.6.4git (Nextcloud)" "192.168.1.110"
2020-04-23T06:21:36.191836604Z 192.168.1.110 - skjnldsv [23/Apr/2020:06:21:36 +0000] "DELETE /remote.php/dav/files/skjnldsv/Projets HTTP/2.0" 204 0 "-" "Mozilla/5.0 (Linux) mirall/2.6.4git (Nextcloud)" "192.168.1.110"
... etc etc

Matthiasfranck commented 4 years ago

This is just unacceptable and I can no longer trust Nextcloud. I will never use it again. I can not understand that such a big fundamental issue is not solved right away.

skjnldsv commented 4 years ago

Client logs 2.6.4git

#=#=#=#=# Propagation starts 2020-04-23T06:20:32Z (last step: 473 msec, total: 473 msec)
06:20:33||pomodoro.md|INST_SYNC|Up|1587622830|57fae5758e5e15cdfb1f348a4b29a934|1595|00332907ocwwisqscdyi|4||204|1584|1587573309||||
#=#=#=# Syncrun finished 2020-04-23T06:20:33Z (last step: 622 msec, total: 1096 msec)
#=#=#=# Syncrun started 2020-04-23T06:21:32Z
#=#=#=#=# Propagation starts 2020-04-23T06:21:33Z (last step: 580 msec, total: 580 msec)
06:21:35||Wallet|INST_REMOVE|Up|1573118819|5dc3e3646e02c|0|00332902ocwwisqscdyi|4||204|0|0||||
06:21:35||Wallpapers|INST_REMOVE|Up|1573118964|5dc3e3f507b1f|0|00332904ocwwisqscdyi|4||204|0|0||||
06:21:35||Ressources|INST_REMOVE|Up|1586679129|5e92cd5944383|0|00245943ocwwisqscdyi|4||204|0|0||||
06:21:35||Recipes|INST_REMOVE|Up|1584547518|5e7246be0dffc|0|00372370ocwwisqscdyi|4||204|0|0||||
06:21:36||Projets|INST_REMOVE|Up|1573054036|5dc2e6555e104|0|00332782ocwwisqscdyi|4||204|0|0||||
06:21:36||ReactionGIFs|INST_REMOVE|Up|1586865570|5e95a5a2c15c5|0|00000165ocwwisqscdyi|4|Fichiers locaux et dossier partagé supprimés.|204|0|0||||
06:21:37||Notes|INST_REMOVE|Up|1585258126|5e7d1e9057409|0|00332779ocwwisqscdyi|4||204|0|0||||
06:21:37||Nextcloud|INST_REMOVE|Up|1579014247|5e1dd8680db6b|0|00332778ocwwisqscdyi|4||204|0|0||||
06:21:40||Photos|INST_REMOVE|Up|1584917324|5e77eb4c8122e|0|00332781ocwwisqscdyi|4||204|0|0||||
06:21:41||Images|INST_REMOVE|Up|1583069799|5e5bba6853237|0|00332777ocwwisqscdyi|4||204|0|0||||
06:21:42||Public|INST_REMOVE|Up|1587370555|5e9d5a3bdc9a8|0|00000180ocwwisqscdyi|4||204|0|0||||
06:21:43||Configs|INST_REMOVE|Up|1575530921|5de8b1aa00dc1|0|00332775ocwwisqscdyi|4||204|0|0||||
06:21:43||.Contacts-Backup|INST_REMOVE|Up|1587590604|5ea0b5ccbc5d0|0|00332774ocwwisqscdyi|4||204|0|0||||
06:21:46||emotionwheel.jpg|INST_REMOVE|Up|1577364107|d9ff8ef482f16a83c2f6ea0bd24c7344|317176|00363513ocwwisqscdyi|4||204|0|0||||
06:21:46||Documents|INST_REMOVE|Up|1586158591|5e8adbff18cb1|0|00332776ocwwisqscdyi|4||204|0|0||||
06:21:49||pomodoro.md|INST_REMOVE|Up|1587622830|57fae5758e5e15cdfb1f348a4b29a934|1595|00332907ocwwisqscdyi|4||204|0|0||||
06:21:51||InstantUpload|INST_REMOVE|Up|1587585809|5ea0a311dee9d|0|00332780ocwwisqscdyi|4||204|0|0||||
#=#=#=# Syncrun finished 2020-04-23T06:21:51Z (last step: 18518 msec, total: 19098 msec)
#=#=#=# Syncrun started 2020-04-23T06:22:32Z
#=#=#=#=# Propagation starts 2020-04-23T06:22:32Z (last step: 68 msec, total: 68 msec)
#=#=#=# Syncrun finished 2020-04-23T06:22:32Z (last step: 14 msec, total: 82 msec)
#=#=#=# Syncrun started 2020-04-23T07:20:33Z
#=#=#=#=# Propagation starts 2020-04-23T07:20:33Z (last step: 575 msec, total: 575 msec)
07:20:34||Configs|INST_NEW|Up|1575530918||4096|00381832ocwwisqscdyi|4||201|0|0||||
07:20:34||.Contacts-Backup|INST_NEW|Up|1587590615||12288|00381831ocwwisqscdyi|4||201|0|0||||
07:20:34||Images|INST_NEW|Up|1583069797||4096|00381833ocwwisqscdyi|4||201|0|0||||
07:20:34||Documents|INST_NEW|Up|1583938302||4096|00381834ocwwisqscdyi|4||201|0|0||||
07:20:34||InstantUpload|INST_NEW|Up|1578297635||135168|00381835ocwwisqscdyi|4||201|0|0||||
07:20:34||Nextcloud|INST_NEW|Up|1579014255||4096|00381836ocwwisqscdyi|4||201|0|0||||

skjnldsv commented 4 years ago

Just after moving all my files to the trashbin, the client started syncing again all my files to the server. Hope that helps! @misch7 @DominiqueFuchs, If I can be any more help, I have tons of other logs :pray:

skjnldsv commented 4 years ago

Any infos about the server instance differing from a standard installation? Any specific apps activated?

Which filesystem are you using on your client machine? XFS, ext4, btrfs?

Specific addendum to #2: is E2E activated on the server (if yes - is it activated on your folders?)

completely normal install, docker, 18.0.3. Never had an issue before

UUID=40671515 /boot ext4 defaults,relatime,data=ordered 0 0
UUID=30f0fed0 / ext4 defaults,relatime,data=ordered 0 1
UUID=9bb2b6f1 /home ext4 defaults,relatime,data=ordered 0 0
UUID=27d3bccd swap swap defaults 0 0

Nothing fancy, standard install, no E2E, no ServerSideEncrypt

DominiqueFuchs commented 4 years ago

Thanks @skjnldsv for providing the log snippets ands answering the questions above! That helps a lot trying to reproduce this. Based on the impact of this issue (extreme) and the comparatively low number of reports (4 people in this thread) there seems to be a pretty specific but exotic condition going on.

DominiqueFuchs commented 4 years ago

@skjnldsv Could you elaborate on the thing regarding the trashbin please? Not sure if I got your description right there. Do you mean the unintentionally/suddenly deleted files are synced back again after you put the local copies in the trash bin on your client side?

skjnldsv commented 4 years ago

Sure, obervations are as follow:

See lots of network activity
Investigate, see the desktop client sending lots of data
See said client syncing the whole folder from scratch
Check web and see al previous files are into trashbin and are deleted at the same timestamp as the DELETE requests logs were logged from client to server

Bun-Bun commented 4 years ago

@skjnldsv Could you elaborate on the thing regarding the trashbin please? Not sure if I got your description right there. Do you mean the unintentionally/suddenly deleted files are synced back again after you put the local copies in the trash bin on your client side?

Same scenario on my case. The bad client sent delete requests to the server, which moved everything to the NC trashbin, and then the bad client immediately started reuploading all the files back to the server. Which in my case caused it to fill up and stop working. Locally to the bad client the files were unchanged.

Matthiasfranck commented 4 years ago

In my case the files were gone on both server and client. Even in the nextcloud trash or the client trash (windows) were no files.

So I think that in my case the client started removing files from the computer and this propagated to the server who also started removing the files.

skjnldsv commented 4 years ago

@Bun-Bun linux too?

Bun-Bun commented 4 years ago

As my post earlier said

Server OS: Centos 7 Desktop client: Windows 2.6.2 (Win 10 Pro 1803) Nextcloud: 14

skjnldsv commented 4 years ago

Sorry! Thanks :)

bengtj commented 4 years ago

Maybe this is related. I had a 14GB folder with misc files that I yesterday:

Disconnected from nextcloud.
Moved from C:\ to D:\ with explorer.
Connected it again to nextcloud.

Today I discovered that the folder was empty on all clients. I did not find anything on the server in trash. I got desperate and now I'm restoring a backup that I fortunately did the 24th.

And just now I found that there was a restore button on the delete activity in the server that maybe could have fixed it even though I did not find the folders in the trash.

Anyhow, maybe this can be used for reproducing the issue. The client was "Version 2.6.2stable-Win64 (build 20191224)" and Nextcloud 17.0.2 running in a docker container on Fedora 30.

markc commented 4 years ago

Ubuntu 20.04 server with Nextcloud 18.0.4. A friend just had a similar experience. Everything on the server was in Trash. We tried to "bulk" Restore everything but all the folders and files inside the 1/2 dozen top level folders were restored up to the root folder, all except the original top level folders which remained in Trash. That was weird so seeing the users linux client was crashing (not sure of the version) they had stopped the syncing so they still had all their files locally. He deleted all files again on the server and I did a "php occ trashbin:cleanup user" and a "php occ files:scan user" to clear everything out so now he is re-syncing about 20GB which will take about 24 hours. Said friend is not happy, to say the least, and after this experience I am not keen to encourage anyone else to use Nextcloud.

jvsalo commented 4 years ago

We had a similar case recently; a Windows client (2.6.2stable-Win64 (build 20191224)) suddenly asked the server (18.0.4) to delete all locally synced files on the server, while actually the client's local files were intact.

After that, it started uploading the local files back. Some of the deleted items were shares from other users (now un-shared; had to share again); the newly uploaded files now (unsurprisingly) showed up as the user's personal files instead of as shares from other users.

er-vin commented 4 years ago

Looks like a hard one to find and trigger (never triggered on my end for instance). What we would need is to collect as much information as possible to converge toward a scenario to just be able to reproduce it. This means we'd need something along those lines for each such mass removal event:

complete client log files around the time where it happened (so make sure the client runs with --logdir, --logdebug and probably --logexpire if you don't want said logs to fill up your disk too much;
the sync summary log for that particular sync which removed data;
the journal db files of the client (look in your nextcloud.cfg they are referenced there);
server log files containing the affected user logs for around the time of the event (probably one hour before and one hour after to be on the safe side).

Of course if someone has a minimal reproducible scenario already we're interested but from the comments in there I don't see anything which looks like it yet... so hopefully gathering richer data will lead us to find such a scenario. :crossed_fingers:

gdurin commented 4 years ago

Maybe mine is a good scenario to investigate this issue, as I have lost 90% of my fundamental data and folder(on the client side) I think it worth explaining the situation. After my initial shock, I think I understood what happened.

Time0, at home: setup a brand new nas, with ubuntu 18 and nextcloud(0) from snap. Worked fine for 2 months during the lockdown with no problems

Time1, at work Since I had to bring back the nas to my working place, I though that changing the IP of nexcloud on my client would be enough. I was wrong, since the nas had a new IP I had to reinstall ubuntu and nextcloud(1). So I deleted the original data on nexcloud(0) thinking I was going to same them from my client soon...

Time2, at work
Before the sync started, I had tons of data on client, and nothing on the server. I created the folders I was going to sync on the server (big error!) Nextcloud(1) synced ONLY the files made after the last sync at home, deleteting all the older files and folders. The logs are full of lines like this...

||somethingi|INST_REMOVE|Down|1431508115||4096||4||0|0|0||||

I realized the db of nextcloud(0) where still in place on the client, so for sure they are the origin of the mess. In my opinion, this is a HUGE bug. Anyway, everytime I mess the db (Broken/unsynced/old) I have problems (now I understand it).

By the way, where have all the INST_REMOVE files gone? Not in the client trash (unfortunately). I am still fighting to get them back using exotic data recovering, with no success at the moment. I can add any file you need, in case

Hope this helps Gianfranco

skjnldsv commented 4 years ago

@nextcloud/desktop anything we can do to help here?

christianpayer commented 4 years ago

I also experienced the same or at least a similar problem a couple of times. I will describe shortly what is happening:

I am running a Nextcloud server on raspbian, while mainly use three clients to connect to it. Two running Arch Linux with v2.6.4git and one on a Macbook with client version 2.6.4stable (build 20200303). Without a specific reason (but probably due to a file/folder creation event), a client deletes almost every file of a synced folder on the server. On the client PC the files are still present, but when looking into the folders with a file manager, e.g., dolphin, there are no icons (checkmarks, etc.) on the folders/files. When another client is running on another PC after this deletion event, all the files will be deleted locally on the other PC as well Now when restarting the client on the PC with the rouge client, the client recognizes that there are new files and syncs them again. Also the icons in dolphin are now working again and show that the missing on the server are syncing again.

This happened to me already a couple of months ago, but back then, I did not care that much, as it seemed that every file was still there, but I cannot confirm that. However, due to regular backups, it is not problematic for me.

Now this happened already 3 times during the last ~3 weeks, so it got really annoying. I started the clients with the advanced logging commands, and today, it happened again. I tried to copy all log files to be able to reproduce what I did before and after the deletion incident. I don't know what to look for in the log files, but maybe a developer can get some useful information out of it. As the files contain sensitive information, I won't share this publicly, but I am willing to share the files with a developer who wants to investigate in that.

DominiqueFuchs commented 4 years ago

I recently had some time to:

Take a real (real) close look again to the diff between 2.5.2 and 2.5.3 as a really basic way to spot some undefined behavior b/c @rbu reported that he was able to prevent this from happening by downgrading that way. I simply wasn't able to find anything related. Either I'm just unable to spot it (possible, maybe I'm just blind in this regard) or it is just coincidence
Look at the barest level of the sync engine (how csync is integrated in NC) to look at this from the bottom upwards.

AFAIK there is simply one single place that could trigger the false REMOVE instructions that can be seen in the logs from the client UP to server, happening in context of an iterative propagation phase at root level, which is this one:

https://github.com/nextcloud/desktop/blob/5db717d48c7ad32e24ba7f12aabcee09e2ae424c/src/csync/csync_reconcile.cpp#L135-L154

Now this is just a first glance and I may not have enough of the big picture in mind, but in general we seem to land here, meaning that the engine determined the current reconciliation item (folder) to be not on other tree, even though it should be:

https://github.com/nextcloud/desktop/blob/5db717d48c7ad32e24ba7f12aabcee09e2ae424c/src/csync/csync_reconcile.cpp#L113

Now the evil thing. If this is indeed the case (and again: not just shortsight of mine, yielding rubbish here) and the evaluation above is invalid for some reason, there is no second fallback: If so, CSYNC_INSTRUCTION_NONE or _UPDATE_METADATE will yield a REMOVE. AFAIK, CSYNC_INSTRUCTION_NONE could indeed be a possibility here, given the hypothetical circumstance that this is an invalid state.

tl;dr Above would be an ideal case to test/debug this (be it the findFile() function itself, which I doubt, or a possibly invalid path argument due to csync_file_stat_t *cur), but it is still necessary to actively reproduce the behaviour of affected users, which I wasn't able to.

DominiqueFuchs commented 4 years ago

Another addendum/observation (rather obvious, but better state this here): Looking at the log files, the defect instruction trigger only happens for base folders ("only" as in "it doesn't fail from the top recursively, but for the root elements directly" not as in "whatever, only base folders, could be worse 🤨").
Finding the trigger for this bug may happen in a significant higher code level bc right now I can't even imagine why the root folders should be processed here as elements (even more without any reason while idling). There is nothing like a "regular check interval" for these elements as long as they (not their childs, specifically the root elements itself) do not get changed. Appearing in the reconciliation phase that way is probably itself a result of the bug.

skjnldsv commented 4 years ago

Thank you so much for investigating @DominiqueFuchs ! Maybe shall we add a safe if on line 153 then? (No knowledge on the desktop client whatsoever ! :see_no_evil: ) Just to be on the safe side?

DominiqueFuchs commented 4 years ago

@skjnldsv This would need a good condition for a deterministic case i.e. „if file really deleted and not magical misinterpretation of the tree here“. Because the current fallthrough-case is the way the algorithm works (if file not present and all other cases before were false it must have been deleted on this tree) I can not think if a good prevention here. Finding the cause for the undefined state remains the goal.

One possible situation I‘ll try to check is that of a csync reconciliation entry with a faulty empty local tree:

Remote tree is correct and populated
Folders in remote tree also exist on client sync folder
Algorithm starts with an (incorrect) empty local tree, resulting in the assumption that local base folders all have been removed

This would match the cases described above where the files remained intact locally and get synced upwards again after the DELETE instructions (when algorithm starts again with correct local tree)

But like I said I have to test this, maybe these assumptions include misunderstandings of the big picture.

skjnldsv commented 4 years ago

Okay, let us know if e can do anything to help! :)

DominiqueFuchs commented 4 years ago

I proposed a hotfix for this in #2183

Note that this is really just a hotfix and based on multiple assumptions due to the lack of reproducability. However, the check introduced by this is generally a good idea and if with a little luck some of the affected people in here can test this as soon as it's merged - negative or (hopefully) positive results will itself be helpful in pinpointing it that way.

jospoortvliet commented 4 years ago

The hotfix made it into 2.6.5, go get the new version (website is updated, blog is out ) and let us know right away if it happens again!

DominiqueFuchs commented 4 years ago

Another comment to clear some things up:

Even though your data loss is a very ugly one @gdurin the underlying issue is different from the bug discussed here. Your remove instruction went DOWNstream and are (AFAICT from your chronological report) due to the re-setups and config changes, whereas I'm afraid the main reason was the creation of empty folders by hand without the client ever reconnecting to the new instance to that point. Sync-Logically, the downstream removal notifications were totally correct assuming you created folders matching the older populated ones with a newer timestamp but empty.

@christianpayer I still think your log files could be very valuable. When you were saying

a client deletes almost every file of a synced folder

do you mean the content of a single, non-toplevel folder were erased, not the base folder itself? If needed, I'm happy to help you with some regex-tooling or smth. if you want to replace specific information in your log file. By searching for INST_REMOVE in the log file you may be relatively fast able to finde the section that contains a rapid list of consecutive delete instructions belonging to your incident.

rbu commented 4 years ago

Thank you especially @DominiqueFuchs for keeping up with this bug better than I had. After I upgraded to 2.6.2 in January (my last reply), this happened two more times. However, a bit later I removed a remote share that I kept in sync in my root folder: This was a read-write Samba share that I exposed through nextcloud via External Storage, which lived on the top level of my Nextcloud folder (for Scans -- sidenote: I changed it so new documents created by the Scanner get copied to nextcloud via incrond). Ever since I made that change (on the server!), the client problem disappeared. I now keep updating nextcloud-client through whichever version the Fedora package has.

christianpayer commented 4 years ago

@DominiqueFuchs Many thanks for investigating in this issue! And sorry for the delayed reply.

To clarify my setup: On the server I have multiple top-level folders, and the folders that I want to sync with a specific client are set up via the button "Add Folder Sync Connection", e.g., on the server I have the folders /a /b and /c and I set up a client to have e.g. two "folder sync connections" /a and /c. When such a delete event occurs, only a single sync connection is affected, let's say it is /c. Within the folder /c I have a folder structure like this: /c/x, /c/y, and /c/z. Then, it can happen that all contents of /c/x and /c/y get deleted on the server, but /c/z stays intact. I cannot completely confirm, that all subfolders and -files of /c/z are unaffected, but I think so. On the harddrive, it seems that no files are deleted at all on the PC with the rogue client. (On the other synced PCs the files get deleted - which is to be expected.) After some time (or a client restart), the rogue client somehow detects that the files on the harddisk are not synced with the server and starts to upload them again.

Now to the bad news: I updated all my clients to 2.6.5, and the mass deletion event happened again today... I'm now confused, whether I'm affected by this bug, or something else in my system is faulty... It happened on my PC running Arch Linux, while it was the only client connected and all my other PCs were turned off.

I unfortunately did not have time to look more deeply into my log files. I hope I find time over the weekend and will keep you updated if I spot something suspicious.

DominiqueFuchs commented 4 years ago

Hi @christianpayer, thanks for reporting back and sorry to hear that the issue persists for you 🙁

To clarify my setup: [...] When such a delete event occurs, only a single sync connection is affected, let's say it is /c. [...] Then, it can happen that all contents of /c/x and /c/y get deleted on the server, but /c/z stays intact.

Thanks for clarifying! This is indeed a lot different than what I understood from this ticket so far. Unfortunately, I'm not surprised then that the newly introduced check didn't prevent failure for you. Technical note: But with the info you provided I'm assuming that the underlying bug is to be found in invalid content of the sync trees (not equal to an totally empty tree). This will hopefully help digging further. Given that @rbu reported that removing a remote share prevented the bug to be triggered, this may be the crucial thing that's common here and needed to reproduce it.

I unfortunately did not have time to look more deeply into my log files. I hope I find time over the weekend and will keep you updated if I spot something suspicious.

Thanks again!

@rbu Just to assure: Your synced/affected folders were regular User folder items, not specific sync connections as with the setup described above?

christianpayer commented 4 years ago

I did not fully investigate all the entries in the log files, but I have some insights that I want to share and may help in understanding this issue. Same as the others, there are lots of INSTRUCTION_REMOVE entries in my log files.

As I told you already, on Friday I had again a deletion event. In the log of the mass deletion, there are 329 INSTRUCTION_REMOVE events on both files and folders. Around two hours later, the client observed on its own that there are "new" files on the harddisk and it started to re-upload the files again. The client identified 11139 files and folders (INSTRUCTION_NEW) and started to sync them. I only observed the deletion event after the client already started uploading, so I let the client run to upload the files again. However, I did not restart the client after the uploading finished.

Now it starts getting interesting: Every 3 to 4 hours over the weekend the same procedure repeated exactly the same. This means, exactly the same 329 files/folders were deleted, while the same 11139 files/folders were uploaded again every 3 to 4 hours. So the client stayed in this rogue state. This was the only client connected to my nextcloud server and I did not modify any file in the meantime. I observed this behavior and closed the rogue client. (I think the client did not quit successfully but crashed with a segfault. However, I am not completely sure about that...)

Then, I started the other clients such that everything on my computers is up-to-date.

Unfortunately, today another mass-deletion event occurred and all clients have to get synced again...

However, I might have an idea, what caused the client to start to misbehave. I'm running the client on a PC that does heavy calculations and I run scripts that use lots of memory. Now from time to time the main memory runs out completely, such that my script that misbehaves is getting killed. Today, I ran again a script that had to be killed due to excessive memory usage. Note that I also do not have swap activated. So maybe this is bug, where a memory allocation fails, which is not detected. However, this is just my interpretation of the observed behavior.

I hope my insights can help in investigating this issue. If I should look for something specific in my log files, just tell me!

DominiqueFuchs commented 4 years ago

@christianpayer Thanks a lot for taking the time to look through the log and giving insight, also about your machine under heavy load! I assume you've taken a look at the lines above the first of the INSTRUCTION_REMOVE ans there is nothing suspicious?

christianpayer commented 4 years ago

I just checked the files again and it seems that there are also lots of 'INSTRUCTION_UPDATE_METADATA' entries. In one of the deletion events there are 10565 of these entries. Before each 'INSTRUCTION_UPDATE_METADATA' there is a line like this:

2020-07-17 12:14:14:228 [ info nextcloud.sync.csync.updater ]:  Database entry found for *hidden*, compare: 1512551715 <-> 1512551715, etag: bde7c50dd0d676751e3b7701a661a253 <-> bde7c50dd0d676751e3b7701a661a253, inode: 0 <-> 7492986, size: 18090775 <-> 18090775, perms: 9f <-> 1, checksum:  <-> SHA1:*hidden* , ignore: 0,  e2e:

So it looks like that the inode entry is different. Then after the INSTRUCTION_UPDATE_METADATA block there are these lines

2020-07-17 12:17:10:472 [ info nextcloud.sync.csync.csync ]:    Update detection for remote replica took 178.59 seconds walking 11461 files
2020-07-17 12:17:10:472 [ info nextcloud.sync.csync.utils ]:    Memory: 1383892K total size, 170068K resident, 41380K shared
2020-07-17 12:17:10:472 [ info nextcloud.sync.engine ]: #### Discovery end ####################################################  178598 ms
2020-07-17 12:17:10:472 [ info nextcloud.sync.csync.csync ]:    Reconciliation for local replica took  0 seconds visiting  0  files.

And then the block of 'INSTRUCTION_REMOVE' is starting.

rbu commented 4 years ago

@rbu Just to assure: Your synced/affected folders were regular User folder items, not specific sync connections as with the setup described above?

Correct. The client triggered a remove of all files in my User's folder. Ironically, the files in the External Storage are the only ones that weren't touched.

Christophe31 commented 4 years ago

I had same Issue… Not sure it's linked but my / partition (including /tmp and /var) is ~100% full. (ubuntu). (while my home partition containing my synced nextcloud folder had remaining space)

jospoortvliet commented 4 years ago

I had same Issue… Not sure it's linked but my /tmp and my / partitions are ~100% full. (ubuntu). (while my home partition containing my synced nextcloud folder had remaining space)

I had it too today, and my /tmp is empty ;-) It happened after a fresh login after a fresh boot.

More details:

It seems a bit like the client moved existing files to a file named .[originalfilename].~[six-character code] so my files are now named, for example: .announcement schedule.md.~57f373ce - the date of this file is still the original date, so last modified weeks, months or even years ago. Then, that file was copied to the original file again: .announcement schedule.md -> with the date of course being today. All those files were marked as new. So the original files were all deleted on the server. New files appeared... If the new files would have been uploaded I'm not sure - I stopped the sync at some point, after noticing the data loss, and started to restore things from trash while also locally making a copy of the affected folders to a backup location.

I restored everything by first resting from trash, then a copy from the backup folder (without overwrite).

Of course, all metadata (chat logs, comments, share links, favorites) were lost.

Both me and a colleague had this, just one day apart. She had just updated to 3.0.1, I did too. She's no mac, I'm on Linux. I'm now running the client with logging on and have over a dozen log files, but I'm guessing those aren't that helpful - the issue hasn't happened again...

robuswalk commented 4 years ago

Hello ! i also have the same problem, in the last 4 months it has occurred 6 times with 4 different users (3 mac 1 pc). It happens that these clients randomly automatically delete folders (on the server), even nested ones, and then automatically start reloading them from scratch ... it's really frustrating, I've tried several solutions (update client, update, server, recreate the server from scratch, reinstalling everything) but the problem persists ...

er-vin commented 4 years ago

Hello ! i also have the same problem, in the last 4 months it has occurred 6 times with 4 different users (3 mac 1 pc). It happens that these clients randomly automatically delete folders, even nested ones, and then automatically start reloading them from scratch ... it's really frustrating, I've tried several solutions (update client, update, server, recreate the server from scratch, reinstalling everything) but the problem persists ...

I think what you're describing is #260 which is a slightly different issue. #1433 has the client issuing removals on the server as well.

Christophe31 commented 4 years ago

Some of my team mate more aware than I am confirmed me that in my case too, some files were never re-oploaded.