Open mdusher opened 6 years ago
GitMate.io thinks the contributors most likely able to help are @ownclouders, and @PVince81.
Possibly related issues are https://github.com/owncloud/core/issues/240 (--- Removed ---), https://github.com/owncloud/core/issues/3260 (- Removed -), https://github.com/owncloud/core/issues/31165 (remove duplicated storages while checking the filecache corruption), and https://github.com/owncloud/core/issues/29708 (Case sensitive usernames when logging in with an app password via webdav).
I suspect a background job is cleaning them up when the folder no longer exists in the filecache
Yes, there's one.
Not sure why the filecache would clean itself.
The question is also "how" is the storage not available. Is it a NFS mount that is missing ?
ownCloud has some checks in place to find out if the data folder is missing (when ".ocdata" does not exist) or for user homes if the "files" subdir is missing and the user already logged in before. In both these cases, it would throw a "StorageNotAvailableException" (mapped to 503) which tells the clients to go away when accessing over Webdav.
Maybe in your setup the outage looks different on the FS level in a way that would make OC unable to detect this with the current mechanisms ? Assuming you're not talking about external storages but the regular home storages.
@mdusher ^
@PVince81
In terms of storage, I'm referring to the user's home directories. We're running CERN's EOS as our underlying storage and interact with it via a FUSE mount. In terms of defining "unavailable", I mean when we've experienced an issue and the storage has crashed or we've taken it down on purpose for maintenance.
I've definitely encountered the StorageNotAvailableException
before and narrowed it down to the .ocdata
file being unavailable so that part definitely works! My suspicion is that something is running that checks for .ocdata at the start of it's run and then goes through the rest of it's logic with the assumption that the storage is there (which is why my suspicion was a cron job).
I'm wondering if there is a way to possibly exclude the directory set in "share_folder" from being removed from the filecache?
In this specific scenario about unavailability, I think the filecache should stay untouched.
It is also likely that you ran into a race condition where some PHP request already went past the "ocdata" check and continued processing. Or maybe the cron job was already running and the test had already been done at the time the storage was suddenly gone.
This would only happen for a single cron job or single PHP request, so I'd find it strange if subsequent PHP requests would also bypass the check for whatever reason and result in empty folders.
Did everything disappear from file cache or really just that one "share_folder" ?
Things to verify:
@PVince81 I completely agree that it's quite likely it's already past the "ocdata" check. Due to the size of our installation, we are taking advantage of the ability to run cron.php multiple times asynchronously (we get terribly behind on the jobs otherwise).
As far as I can tell, it is just the "share_folder" that goes missing from the filecache (it's the only one that gets reported by our users).
Is there any updates on this issue?
@PVince @mdusher I just had a customer Case where files disappeared in the /Shared folder while having network issues with their primary storage (NFS mount).
I wonder if the file_exists()
check on .ocdata can get a cached result when NFS Caching is on (which is by default).
The customer reported that the nfs mount disappeared and the sync clients were able to upload files to a new folder at the same location where the nfs mount usually is.
@micbar I'm able to reproduce the problem, please contact me
I've been chasing problems with the Shared folder and shares being removed on our side for a while and think I have come up with a fairly reliable theory on what is happening on our side.
We are using CERN's EOS storage as our backend provider via a fuse mount provided by eosd.
After some testing, it seems when that fuse mount is under high I/O load or is unavailable - file_exists()
will return false
(I'm guessing because it doesn't get a timely response). This also happens when you stop the eosd service (which stops providing the fuse mount).
My theory is that when the OC\Files\ScanFiles
background job runs to clean up the filecache table, it's been encountering this edge case which results in file_exists()
return false
and so file metadata being removed from the filecache table. Following that, the OC\Files_Sharing\DeleteOrphanedShares
job runs, which then deletes the shares that point to file ids that do not exist in the filecache table.
Now, my understanding is that when cron.php
(or index.php
) is run, it checks for the existence of .ocdata
in the data directory. However, because the background tasks executed via cron.php
can run for up to 15 minutes - it's entirely possible that the storage might become unresponsive during that time and it will continue to perform tasks assuming that the storage is still there.
As a test, I've modified remove()
and removeChildren()
in lib/private/Files/Cache/Cache.php
to run \OC_Util::checkDataDirectoryValidity()
before performing the DELETE query and exit quietly if it reports any errors (see: https://github.com/mdusher/core/commit/e4dcf56c634ac640166170016d8c8e796608031c)
While this is technically a performance hit when a user deletes a file or folder, we are willing to take a hit in delete performance rather than having a user lose access to their data due to an edge case.
We've been experiencing an issue where our "Shared" directory is removed from the filecache for some (not all) users when our underlying storage becomes unavailable (ie. we take it offline for an upgrade or an unplanned outage).
This causes the affected user's to also lose all their current shares (I suspect a background job is cleaning them up when the folder no longer exists in the filecache) and it also appears in the activity log as the user deleting the folder (which is not the case).
I've been unable to pin down exactly what is causing this to happen as it is an event that occurs pretty irregularly and it seems pretty hard to replicate in our test environment.
My suspicion is that one of the following is may be part of the cause but have been unable to confirm it:
Any suggestions to troubleshoot this are welcome!
Steps to reproduce
Unknown, the only consistent symptom is that it occurs when our file system becomes unavailable.
Expected behaviour
Shared directory is not removed from the filecache when the storage becomes unavailable.
Actual behaviour
Shared directory is removed from the filecache when the storage becomes unavailable.
Server configuration
Operating system: RHEL7
Web server: Apache 2.4.6
Database: MariaDB 10.0.28
PHP version: PHP-FPm 7.0.30
ownCloud version: 10.0.3
Updated from an older ownCloud or fresh install: Updated
Where did you install ownCloud from: TAR on the ownCloud website
Signing status (ownCloud 9.0 and above):
Integrity checker has been disabled. Integrity cannot be verified.
The content of config/config.php:
List of activated apps:
Are you using external storage, if yes which one: No
Are you using encryption: No
Are you using an external user-backend, if yes which one: No