Configured share_folder is removed from filecache when storage is unavailable

mdusher commented 6 years ago

We've been experiencing an issue where our "Shared" directory is removed from the filecache for some (not all) users when our underlying storage becomes unavailable (ie. we take it offline for an upgrade or an unplanned outage).

This causes the affected user's to also lose all their current shares (I suspect a background job is cleaning them up when the folder no longer exists in the filecache) and it also appears in the activity log as the user deleting the folder (which is not the case).

I've been unable to pin down exactly what is causing this to happen as it is an event that occurs pretty irregularly and it seems pretty hard to replicate in our test environment.

My suspicion is that one of the following is may be part of the cause but have been unable to confirm it:

Having 'share_folder' configured in config.php
ownCloud cron jobs are still running when the storage becomes unavailable and removes the share folder as it

Any suggestions to troubleshoot this are welcome!

Steps to reproduce

Unknown, the only consistent symptom is that it occurs when our file system becomes unavailable.

Expected behaviour

Shared directory is not removed from the filecache when the storage becomes unavailable.

Actual behaviour

Shared directory is removed from the filecache when the storage becomes unavailable.

Server configuration

Operating system: RHEL7

Web server: Apache 2.4.6

Database: MariaDB 10.0.28

PHP version: PHP-FPm 7.0.30

ownCloud version: 10.0.3

Updated from an older ownCloud or fresh install: Updated

Where did you install ownCloud from: TAR on the ownCloud website

Signing status (ownCloud 9.0 and above): Integrity checker has been disabled. Integrity cannot be verified.

The content of config/config.php:

{
    "system": {
        "instanceid": "5230042dc1897",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": {
            "0": "cloudstor.aarnet.edu.au",
        },
        "datadirectory": "\/cloudstor\/data\/owncloud\/data",
        "version": "10.0.3.3",
        "dbtype": "mysql",
        "dbname": "owncloudstable82",
        "dbhost": "127.0.0.1:6033",
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "dbtableprefix": "",
        "installed": true,
        "operation.mode": "clustered-instance",
        "default_language": "en_GB",
        "defaultapp": "files",
        "knowledgebaseenabled": true,
        "enable_avatars": false,
        "allow_user_to_change_display_name": false,
        "session_lifetime": 86400,
        "session_keepalive": true,
        "token_auth_enforced": false,
        "mail_domain": "aarnet.edu.au",
        "mail_from_address": "cloudstor-noreply",
        "mail_smtpmode": "php",
        "overwriteprotocol": "https",
        "overwrite.cli.url": "https:\/\/cloudstor.aarnet.edu.au\/plus",
        "htaccess.RewriteBase": "\/plus",
        "trashbin_retention_obligation": "30, 60",
        "appcodechecker": false,
        "updatechecker": false,
        "has_internet_connection": true,
        "check_for_working_webdav": false,
        "check_for_working_htaccess": true,
        "log_type": "owncloud",
        "logfile": "\/cloudstor\/logs\/owncloud\/owncloud.log",
        "loglevel": 2,
        "logtimezone": "UTC",
        "log_query": false,
        "customclient_desktop": "https:\/\/cloudstor.aarnet.edu.au\/client-download\/",
        "customclient_android": "https:\/\/play.google.com\/store\/apps\/details?id=au.edu.aarnet.cloudstor.android",
        "customclient_ios": "https:\/\/itunes.apple.com\/au\/app\/cloudstor\/id1215476371?mt=8",
        "cron_log": true,
        "appstore.experimental.enabled": false,
        "apps_paths": [
            {
                "path": "\/cloudstor\/www\/owncloud\/apps",
                "url": "\/apps",
                "writable": true
            },
            {
                "path": "\/cloudstor\/www\/owncloud\/3rdparty-apps",
                "url": "\/3rdparty-apps",
                "writable": true
            }
        ],
        "enable_previews": true,
        "enabledPreviewProviders": [
            "OC\\Preview\\PNG",
            "OC\\Preview\\JPEG",
            "OC\\Preview\\GIF",
            "OC\\Preview\\BMP",
            "OC\\Preview\\XBitmap",
            "OC\\Preview\\TXT",
            "OC\\Preview\\MarkDown",
            "OC\\Preview\\Illustrator",
            "OC\\Preview\\Postscript",
            "OC\\Preview\\Photoshop",
            "OC\\Preview\\Movie"
        ],
        "maintenance": false,
        "singleuser": false,
        "memcache.local": "\\OC\\Memcache\\APCu",
        "memcache.distributed": "\\OC\\Memcache\\Redis",
        "redis.cluster": {
            "seeds": [
                "127.0.0.1:6379"
            ],
            "timeout": 0,
            "read_timeout": 0,
            "failover_mode": 2
        },
        "memcached_servers": [
            [
                "127.0.0.1",
                11211
            ]
        ],
        "blacklisted_files": [
            ".htaccess"
        ],
        "share_folder": "\/Shared",
        "cipher": "AES-256-CFB",
        "minimum.supported.desktop.version": "2.4.2",
        "quota_include_external_storage": false,
        "filesystem_check_changes": 0,
        "filesystem_cache_readonly": false,
        "forwarded_for_headers": [
            "HTTP_X_FORWARDED",
            "HTTP_FORWARDED_FOR"
        ],
        "filelocking.enabled": false,
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "upgrade.disable-web": true,
        "upgrade.automatic-app-update": false,
        "integrity.check.disabled": true,
        "cache_path": "\/cloudstor\/data\/tmp",
        "tempdirectory": "\/cloudstor\/data\/tmp",
        "mail_smtpdebug": false,
        "mail_smtphost": "smtp.aarnet.edu.au",
        "mail_smtpport": "25",
        "mail_smtptimeout": 10,
        "preview_office_cl_parameters": "",
        "preview_max_scale_factor": 10,
        "preview_max_filesize_image": 100,
        "openssl": [],
        "activity_expire_days": 365,
    }
}

List of activated apps:

Enabled:
  - activity: 2.3.4
  - cloudstortheme: 1.0.0
  - collections: 1.1.1
  - comments: 0.3.0
  - configreport: 0.1.1
  - dav: 0.3.0
  - dicomviewer: 0.0.6
  - federatedfilesharing: 0.3.1
  - federation: 0.1.0
  - files: 1.5.1
  - files_clipboard: 0.6.4
  - files_external: 0.7.1
  - files_jmol: 0.0.1
  - files_pdfviewer: 0.8.2
  - files_sharing: 0.10.1
  - files_texteditor: 2.2
  - files_thingiview: 0.0.1
  - files_trashbin: 0.9.1
  - files_versions: 1.3.0
  - files_videoplayer: 0.9.8
  - filescan: 0.0.1
  - filesenderapp: 1.0
  - firstrunwizard: 1.1
  - gallery: 16.1.0
  - impersonate: 0.1.0
  - market: 0.2.2
  - music: 0.9.2
  - notifications: 0.3.1
  - onlyoffice: 1.3.0
  - password_policy: 2.0.0
  - provisioning_api: 0.5.0
  - security: 0.0.2
  - updatenotification: 0.2.1
  - user_saml: 0.4
Disabled:
  - encryption
  - external
  - files_antivirus
  - systemtags
  - templateeditor
  - user_external

Are you using external storage, if yes which one: No

Are you using encryption: No

Are you using an external user-backend, if yes which one: No

ownclouders commented 6 years ago

GitMate.io thinks the contributors most likely able to help are @ownclouders, and @PVince81.

Possibly related issues are https://github.com/owncloud/core/issues/240 (--- Removed ---), https://github.com/owncloud/core/issues/3260 (- Removed -), https://github.com/owncloud/core/issues/31165 (remove duplicated storages while checking the filecache corruption), and https://github.com/owncloud/core/issues/29708 (Case sensitive usernames when logging in with an app password via webdav).

PVince81 commented 6 years ago

I suspect a background job is cleaning them up when the folder no longer exists in the filecache

Yes, there's one.

Not sure why the filecache would clean itself.

The question is also "how" is the storage not available. Is it a NFS mount that is missing ?

ownCloud has some checks in place to find out if the data folder is missing (when ".ocdata" does not exist) or for user homes if the "files" subdir is missing and the user already logged in before. In both these cases, it would throw a "StorageNotAvailableException" (mapped to 503) which tells the clients to go away when accessing over Webdav.

Maybe in your setup the outage looks different on the FS level in a way that would make OC unable to detect this with the current mechanisms ? Assuming you're not talking about external storages but the regular home storages.

@mdusher ^

mdusher commented 6 years ago

@PVince81

In terms of storage, I'm referring to the user's home directories. We're running CERN's EOS as our underlying storage and interact with it via a FUSE mount. In terms of defining "unavailable", I mean when we've experienced an issue and the storage has crashed or we've taken it down on purpose for maintenance.

I've definitely encountered the StorageNotAvailableException before and narrowed it down to the .ocdata file being unavailable so that part definitely works! My suspicion is that something is running that checks for .ocdata at the start of it's run and then goes through the rest of it's logic with the assumption that the storage is there (which is why my suspicion was a cron job).

I'm wondering if there is a way to possibly exclude the directory set in "share_folder" from being removed from the filecache?

PVince81 commented 6 years ago

In this specific scenario about unavailability, I think the filecache should stay untouched.

It is also likely that you ran into a race condition where some PHP request already went past the "ocdata" check and continued processing. Or maybe the cron job was already running and the test had already been done at the time the storage was suddenly gone.

This would only happen for a single cron job or single PHP request, so I'd find it strange if subsequent PHP requests would also bypass the check for whatever reason and result in empty folders.

Did everything disappear from file cache or really just that one "share_folder" ?

Things to verify:

[ ] whether cron jobs, especially the "cleanup orphaned shares" properly triggers "ocdata" check
[ ] what happens if such cron job is running while in parallel the storage becomes available
[ ] what other operations might behave badly if storage becomes unavailable after ocdata check.
[ ] investigate "share_folder" code paths to find out what could cause that one folder alone to disappear

mdusher commented 6 years ago

@PVince81 I completely agree that it's quite likely it's already past the "ocdata" check. Due to the size of our installation, we are taking advantage of the ability to run cron.php multiple times asynchronously (we get terribly behind on the jobs otherwise).

As far as I can tell, it is just the "share_folder" that goes missing from the filecache (it's the only one that gets reported by our users).

mdusher commented 5 years ago

Is there any updates on this issue?

micbar commented 5 years ago

@PVince @mdusher I just had a customer Case where files disappeared in the /Shared folder while having network issues with their primary storage (NFS mount).

I wonder if the file_exists() check on .ocdata can get a cached result when NFS Caching is on (which is by default).

The customer reported that the nfs mount disappeared and the sync clients were able to upload files to a new folder at the same location where the nfs mount usually is.

cdamken commented 5 years ago

@micbar I'm able to reproduce the problem, please contact me

mdusher commented 5 years ago

I've been chasing problems with the Shared folder and shares being removed on our side for a while and think I have come up with a fairly reliable theory on what is happening on our side.

We are using CERN's EOS storage as our backend provider via a fuse mount provided by eosd.

After some testing, it seems when that fuse mount is under high I/O load or is unavailable - file_exists() will return false (I'm guessing because it doesn't get a timely response). This also happens when you stop the eosd service (which stops providing the fuse mount).

My theory is that when the OC\Files\ScanFiles background job runs to clean up the filecache table, it's been encountering this edge case which results in file_exists() return false and so file metadata being removed from the filecache table. Following that, the OC\Files_Sharing\DeleteOrphanedShares job runs, which then deletes the shares that point to file ids that do not exist in the filecache table.

Now, my understanding is that when cron.php (or index.php) is run, it checks for the existence of .ocdata in the data directory. However, because the background tasks executed via cron.php can run for up to 15 minutes - it's entirely possible that the storage might become unresponsive during that time and it will continue to perform tasks assuming that the storage is still there.

As a test, I've modified remove() and removeChildren() in lib/private/Files/Cache/Cache.php to run \OC_Util::checkDataDirectoryValidity() before performing the DELETE query and exit quietly if it reports any errors (see: https://github.com/mdusher/core/commit/e4dcf56c634ac640166170016d8c8e796608031c)

While this is technically a performance hit when a user deletes a file or folder, we are willing to take a hit in delete performance rather than having a user lose access to their data due to an edge case.

owncloud / core