[Bug]: file is locked, existing lock on file: exclusive on big files upload

k-popov commented 2 years ago

⚠️ This issue respects the following points: ⚠️

[X] This is a bug, not a question or a configuration/webserver/proxy issue.
[X] This issue is not already reported on Github (I've searched it).
[X] Nextcloud Server is up to date. See Maintenance and Release Schedule for supported versions.
[X] I agree to follow Nextcloud's Code of Conduct.

Bug description

Getting error "zeros_1gb.new.dd" is locked, existing lock on file: exclusive when uploading large files (1Gb+) with browser.

Setup:

NextCloud 24.0.1 (official docker image).
PHP is run by static FPM pool.
Running in kubernetes (official helm chart).
NextCloud system files (/var/www/html) stored in RWO volume, 10 pods use this volume on one node.
User files stored in S3-like storage
File locking set up in Redis (external Redis server).
filelocking.debug set to true

Logs viewed through Kibana show (pod_name - log_message, notice MOVE request, second lock attempt is done by different pod if it matters):

app-nextcloud-int-6f8b6d568-v9lng - NOTICE: PHP message: [owncloud][locking][1] acquire shared lock on "files/zeros_1gb.new.dd" on storage "object::user:popov.k"
app-nextcloud-int-6f8b6d568-v9lng - NOTICE: PHP message: [owncloud][locking][1] change lock on "files/zeros_1gb.new.dd" to exclusive on storage "object::user:popov.k"
app-nextcloud-int-6f8b6d568-vbdbt - NOTICE: PHP message: [owncloud][locking][1] acquire shared lock on "files/zeros_1gb.new.dd" on storage "object::user:popov.k"
app-nextcloud-int-6f8b6d568-vbdbt - NOTICE: PHP message: [owncloud][no app in context][1] {"Exception":"OCP\\Lock\\LockedException","Message":"\"files/9faf980fca1fefa8ed8a83420b80642b\"(\"object::user:popov.k::files/zeros_1gb.new.dd\") is locked, existing lock on file: exclusive","Code":0,"Trace":[{"file":"/var/www/html/lib/private/Files/Storage/Common.php","line":758,"function":"acquireLock","class":"OC\\Lock\\MemcacheLockingProvider","type":"->","args":["files/9faf980fca1fefa8ed8a83420b80642b",1,"object::user:popov.k::files/zeros_1gb.new.dd"]},{"file":"/var/www/html/lib/private/Files/Storage/Wrapper/Wrapper.php","line":610,"function":"acquireLock","class":"OC\\Files\\Storage\\Common","type":"->","args":["files/zeros_1gb.new.dd",1,{"__class__":"OC\\Lock\\MemcacheLockingProvider"}]},{"file":"/var/www/html/lib/private/Files/Storage/Wrapper/Wrapper.php","line":610,"function":"acquireLock","class":"OC\\Files\\Storage\\Wrapper\\Wrapper","type":"->","args":["files/zeros_1gb.new.dd",1,{"__class__":"OC\\Lock\\MemcacheLockingProvider"}]},{"file":"/var/www/html/lib/private/Files/View.php","line":1961,"function":"acquireLock","class":"OC\\Files\\Storage\\Wrapper\\Wrapper","type":"->","args":["files/zeros_1gb.new.dd",1,{"__class__":"OC\\Lock\\MemcacheLockingProvider"}]},{"file":"/var/www/html/lib/private/Files/View.php","line":2076,"function":"lockPath","class":"OC\\Files\\View","type":"->","args":["//zeros_1gb.new.dd",1,false]},{"file":"/var/www/html/apps/dav/lib/Connector/Sabre/Node.php","line":406,"function":"lockFile","class":"OC\\Files\\View","type":"->","args":["//zeros_1gb.new.dd",1]},{"file":"/var/www/html/apps/dav/lib/Connector/Sabre/Directory.php","line":161,"function":"acquireLock","class":"OCA\\DAV\\Connector\\Sabre\\Node","type":"->","args":[1]},{"file":"/var/www/html/3rdparty/sabre/dav/lib/DAV/Tree.php","line":307,"function":"createFile","class":"OCA\\DAV\\Connector\\Sabre\\Directory","type":"->","args":["zeros_1gb.new.dd",null]},{"file":"/var/www/html/3rdparty/sabre/dav/lib/DAV/Tree.php","line":133,"function":"copyNode","class":"Sabre\\DAV\\Tree","type":"->","args":[{"__class__":"OCA\\DAV\\Upload\\FutureFile"},{"__class__":"OCA\\DAV\\Files\\FilesHome"},"zeros_1gb.new.dd"]},{"file":"/var/www/html/3rdparty/sabre/dav/lib/DAV/Tree.php","line":163,"function":"copy","class":"Sabre\\DAV\\Tree","type":"->","args":["uploads/popov.k/web-file-upload-ec4426fdee1b17f40277ca0514822f4b-1656342757986/.file","files/popov.k/zeros_1gb.new.dd"]},{"file":"/var/www/html/apps/dav/lib/Upload/ChunkingPlugin.php","line":94,"function":"move","class":"Sabre\\DAV\\Tree","type":"->","args":["uploads/popov.k/web-file-upload-ec4426fdee1b17f40277ca0514822f4b-1656342757986/.file","files/popov.k/zeros_1gb.new.dd"]},{"file":"/var/www/html/apps/dav/lib/Upload/ChunkingPlugin.php","line":76,"function":"performMove","class":"OCA\\DAV\\Upload\\ChunkingPlugin","type":"->","args":["uploads/popov.k/web-file-upload-ec4426fdee1b17f40277ca0514822f4b-1656342757986/.file","files/popov.k/zeros_1gb.new.dd"]},{"file":"/var/www/html/3rdparty/sabre/event/lib/WildcardEmitterTrait.php","line":89,"function":"beforeMove","class":"OCA\\DAV\\Upload\\ChunkingPlugin","type":"->","args":["uploads/popov.k/web-file-upload-ec4426fdee1b17f40277ca0514822f4b-1656342757986/.file","files/popov.k/zeros_1gb.new.dd"]},{"file":"/var/www/html/3rdparty/sabre/dav/lib/DAV/CorePlugin.php","line":603,"function":"emit","class":"Sabre\\DAV\\Server","type":"->","args":["beforeMove",["uploads/popov.k/web-file-upload-ec4426fdee1b17f40277ca0514822f4b-1656342757986/.file","files/popov.k/zeros_1gb.new.dd"]]},{"file":"/var/www/html/3rdparty/sabre/event/lib/WildcardEmitterTrait.php","line":89,"function":"httpMove","class":"Sabre\\DAV\\CorePlugin","type":"->","args":[{"__class__":"Sabre\\HTTP\\Request"},{"__class__":"Sabre\\HTTP\\Response"}]},{"file":"/var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php","line":472,"function":"emit","class":"Sabre\\DAV\\Server","type":"->","args":["method:MOVE",[{"__class__":"Sabre\\HTTP\\Request"},{"__class__":"Sabre\\HTTP\\Response"}]]},{"file":"/var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php","line":253,"function":"invokeMethod","class":"Sabre\\DAV\\Server","type":"->","args":[{"__class__":"Sabre\\HTTP\\Request"},{"__class__":"Sabre\\HTTP\\Response"}]},{"file":"/var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php","line":321,"function":"start","class":"Sabre\\DAV\\Server","type":"->","args":[]},{"file":"/var/www/html/apps/dav/lib/Server.php","line":352,"function":"exec","class":"Sabre\\DAV\\Server","type":"->","args":[]},{"file":"/var/www/html/apps/dav/appinfo/v2/remote.php","line":35,"function":"exec","class":"OCA\\DAV\\Server","type":"->","args":[]},{"file":"/var/www/html/remote.php","line":166,"args":["/var/www/html/apps/dav/appinfo/v2/remote.php"],"function":"require_once"}],"File":"/var/www/html/lib/private/Lock/MemcacheLockingProvider.php","Line":80,"CustomMessage":"--"}
app-nextcloud-int-6f8b6d568-v9lng - NOTICE: PHP message: [owncloud][locking][1] change lock on "files/zeros_1gb.new.dd" to shared on storage "object::user:popov.k"
app-nextcloud-int-6f8b6d568-v9lng - NOTICE: PHP message: [owncloud][locking][1] release shared lock on "files/zeros_1gb.new.dd" on storage "object::user:popov.k"

From browser the process looks like the following:

file is being uploaded in chunks
after last chunk is uploaded, request "MOVE /remote.php" is done which completes with 423 (Locked) response. (I suspect this code making the request)
error message is displayed in NextCloud pop-up
file remains in NextCloud with .part suffix

When using desktop client to sync the same file server-side logs display the same sequence but the client (probably) does retry for this MOVE request which succeeds.

Steps to reproduce

set up NectCloud like described above
upload 1Gb+ file via browser

Expected behavior

Even large files uploaded successfully.

Solution (?): Locking retried on server side when executing the MOVE request.

Installation method

Official Docker image

Operating system

Debian/Ubuntu

PHP engine version

PHP 8.0

Web server

Nginx

Database engine version

PostgreSQL

Is this bug present after an update or on a fresh install?

Fresh Nextcloud Server install

Are you using the Nextcloud Server Encryption module?

Encryption is Disabled

What user-backends are you using?

[ ] Default user-backend (database)
[X] LDAP/ Active Directory
[ ] SSO - SAML
[ ] Other

Configuration report

{
    "system": {
        "htaccess.RewriteBase": "\/",
        "memcache.local": "\\OC\\Memcache\\APCu",
        "apps_paths": [
            {
                "path": "\/var\/www\/html\/apps",
                "url": "\/apps",
                "writable": false
            },
            {
                "path": "\/var\/www\/html\/custom_apps",
                "url": "\/custom_apps",
                "writable": true
            }
        ],
        "log_type": "errorlog",
        "loglevel": 1,
        "logfile": "",
        "log_type_audit": "errorlog",
        "maintenance_window_start": 100,
        "overwriteprotocol": "https",
        "trusted_proxies": "***REMOVED SENSITIVE VALUE***",
        "objectstore": {
            "class": "\\OC\\Files\\ObjectStore\\S3",
            "arguments": {
                "bucket": "nextcloud-int-sbmt-infra",
                "autocreate": false,
                "key": "***REMOVED SENSITIVE VALUE***",
                "secret": "***REMOVED SENSITIVE VALUE***",
                "hostname": "storage.yandexcloud.net",
                "region": "ru-central1",
                "use_ssl": true,
                "use_path_style": true
            }
        },
        "mail_smtpmode": "smtp",
        "mail_smtphost": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpport": "465",
        "mail_smtpsecure": "ssl",
        "mail_smtpauth": false,
        "mail_smtpauthtype": "LOGIN",
        "mail_smtpname": "***REMOVED SENSITIVE VALUE***",
        "mail_smtppassword": "***REMOVED SENSITIVE VALUE***",
        "mail_from_address": "***REMOVED SENSITIVE VALUE***",
        "mail_domain": "***REMOVED SENSITIVE VALUE***",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": [
            "localhost",
            "nextcloud.sbmt.io",
            "nextcloud"
        ],
        "datadirectory": "***REMOVED SENSITIVE VALUE***",
        "dbtype": "pgsql",
        "version": "24.0.1.1",
        "overwrite.cli.url": "https:\/\/localhost",
        "dbname": "***REMOVED SENSITIVE VALUE***",
        "dbhost": "***REMOVED SENSITIVE VALUE***",
        "dbport": "",
        "dbtableprefix": "oc_",
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "instanceid": "***REMOVED SENSITIVE VALUE***",
        "memcache.distributed": "\\OC\\Memcache\\Redis",
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "redis": {
            "host": "***REMOVED SENSITIVE VALUE***",
            "port": "6379",
            "password": "***REMOVED SENSITIVE VALUE***"
        },
        "ldapProviderFactory": "OCA\\User_LDAP\\LDAPProviderFactory",
        "app_install_overwrite": [
            "files_external_gdrive",
            "drawio"
        ],
        "maintenance": false,
        "filelocking.debug": {
            "true": "",
            "1": "",
            "false": ""
        }
    }
}

List of activated Apps

Enabled:
  - activity: 2.16.0
  - bruteforcesettings: 2.4.0
  - circles: 24.0.0
  - cloud_federation_api: 1.7.0
  - comments: 1.14.0
  - contacts: 4.1.1
  - contactsinteraction: 1.5.0
  - dashboard: 7.4.0
  - dav: 1.22.0
  - drawio: 1.0.3
  - federatedfilesharing: 1.14.0
  - federation: 1.14.0
  - files: 1.19.0
  - files_external: 1.16.1
  - files_pdfviewer: 2.5.0
  - files_rightclick: 1.3.0
  - files_sharing: 1.16.2
  - files_trashbin: 1.14.0
  - files_versions: 1.17.0
  - files_videoplayer: 1.13.0
  - forms: 2.5.1
  - groupfolders: 12.0.1
  - integration_dropbox: 1.0.4
  - integration_google: 1.0.6
  - lookup_server_connector: 1.12.0
  - nextcloud_announcements: 1.13.0
  - notifications: 2.12.0
  - oauth2: 1.12.0
  - officeonline: 1.1.3
  - password_policy: 1.14.0
  - photos: 1.6.0
  - privacy: 1.8.0
  - provisioning_api: 1.14.0
  - serverinfo: 1.14.0
  - settings: 1.6.0
  - sharebymail: 1.14.0
  - systemtags: 1.14.0
  - text: 3.5.1
  - theming: 1.15.0
  - theming_customcss: 1.11.0
  - twofactor_backupcodes: 1.13.0
  - updatenotification: 1.14.0
  - user_ldap: 1.14.1
  - user_status: 1.4.0
  - viewer: 1.8.0
  - workflowengine: 2.6.0
Disabled:
  - accessibility: 1.8.0
  - admin_audit
  - encryption
  - firstrunwizard: 2.11.0
  - logreader: 2.7.0
  - recommendations: 1.2.0
  - support: 1.6.0
  - survey_client: 1.10.0
  - weather_status: 1.2.0

Nextcloud Signing status

No response

Nextcloud Logs

No response

Additional info

No response

Doriangaensslen commented 2 years ago

I think we have the same issue on multiple NC Versions(NC 22.2.3) the latest being NC 24.0.2.

Setup:

Nextcloud on K8s
Redis (6.2.7) as filelocking provider deployed via helm
Chunksize of 240Mb

It is possible to see, that the files are correctly stored on storage, before this happens.

From the nextcloud.log in debug mode and with 'filelocking.debug' => true. We see this at the end of the upload wit the 423 you mentioned in the Browser:

{ "reqId": "uDOo0WwTwNoFWaKL7lpH", "level": 1, "time": "2022-06-28T12:17:56+00:00", "remoteAddr": "111111.11111.11111.1111", "user": "john.doe@google.com", "app": "no app in context", "method": "MOVE", "url": "/remote.php/dav/uploads/john.doe@google.com/web-file-upload-d86cd6c9f67b3659df8b97bce3fdc5a4-1656417981137/.file", "message": "\"files/6c1a102c3da4d854a16bff1982a4b470\"(\"home::john.doe@google.com::files/john.doe@google.com/files/3/5GB.file.upload.part\") is locked, existing lock on file: exclusive", "userAgent": "Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0", "version": "24.0.2.1", "exception": { "Exception": "OCP\Lock\LockedException", "Message": "\"files/6c1a102c3da4d854a16bff1982a4b470\"(\"home::john.doe@google.com::files/john.doe@google.com/files/3/5GB.file.upload.part\") is locked, existing lock on file: exclusive", "Code": 0, "Trace": [ { "file": "/var/www/html/lib/private/Files/Storage/Common.php", "line": 758, "function": "acquireLock", "class": "OC\Lock\MemcacheLockingProvider", "type": "->", "args": [ "files/6c1a102c3da4d854a16bff1982a4b470", 2, "home::john.doe@google.com::files/john.doe@google.com/files/3/5GB.file.upload.part" ] }, { "file": "/var/www/html/lib/private/Files/Storage/Wrapper/Wrapper.php", "line": 610, "function": "acquireLock", "class": "OC\Files\Storage\Common", "type": "->", "args": [ "files/john.doe@google.com/files/3/5GB.file.upload.part", 2, { "class": "OC\Lock\MemcacheLockingProvider" } ] }, { "file": "/var/www/html/lib/private/Files/Storage/Wrapper/Wrapper.php", "line": 610, "function": "acquireLock", "class": "OC\Files\Storage\Wrapper\Wrapper", "type": "->", "args": [ "files/john.doe@google.com/files/3/5GB.file.upload.part", 2, { "class": "OC\Lock\MemcacheLockingProvider" } ] }, { "file": "/var/www/html/lib/private/Files/View.php", "line": 1961, "function": "acquireLock", "class": "OC\Files\Storage\Wrapper\Wrapper", "type": "->", "args": [ "files/john.doe@google.com/files/3/5GB.file.upload.part", 2, { "class": "OC\Lock\MemcacheLockingProvider" } ] }, { "file": "/var/www/html/lib/private/Files/View.php", "line": 2076, "function": "lockPath", "class": "OC\Files\View", "type": "->", "args": [ "/john.doe@google.com/files/3/5GB.file.upload.part", 2, false ] }, { "file": "/var/www/html/apps/dav/lib/Connector/Sabre/Directory.php", "line": 162, "function": "lockFile", "class": "OC\Files\View", "type": "->", "args": [ "/john.doe@google.com/files/3/5GB.file.upload.part", 2 ] }, { "file": "/var/www/html/3rdparty/sabre/dav/lib/DAV/Tree.php", "line": 307, "function": "createFile", "class": "OCA\DAV\Connector\Sabre\Directory", "type": "->", "args": [ "5GB.file", null ] }, { "file": "/var/www/html/3rdparty/sabre/dav/lib/DAV/Tree.php", "line": 133, "function": "copyNode", "class": "Sabre\DAV\Tree", "type": "->", "args": [ { "class": "OCA\DAV\Upload\FutureFile" }, { "class": "OCA\DAV\Connector\Sabre\Directory" }, "5GB.file" ] }, { "file": "/var/www/html/3rdparty/sabre/dav/lib/DAV/Tree.php", "line": 163, "function": "copy", "class": "Sabre\DAV\Tree", "type": "->", "args": [ "uploads/john.doe@google.com/web-file-upload-d86cd6c9f67b3659df8b97bce3fdc5a4-1656417981137/.file", "files/john.doe@google.com/3/5GB.file" ] }, { "file": "/var/www/html/apps/dav/lib/Upload/ChunkingPlugin.php", "line": 94, "function": "move", "class": "Sabre\DAV\Tree", "type": "->", "args": [ "uploads/john.doe@google.com/web-file-upload-d86cd6c9f67b3659df8b97bce3fdc5a4-1656417981137/.file", "files/john.doe@google.com/3/5GB.file" ] }, { "file": "/var/www/html/apps/dav/lib/Upload/ChunkingPlugin.php", "line": 76, "function": "performMove", "class": "OCA\DAV\Upload\ChunkingPlugin", "type": "->", "args": [ "uploads/john.doe@google.com/web-file-upload-d86cd6c9f67b3659df8b97bce3fdc5a4-1656417981137/.file", "files/john.doe@google.com/3/5GB.file" ] }, { "file": "/var/www/html/3rdparty/sabre/event/lib/WildcardEmitterTrait.php", "line": 89, "function": "beforeMove", "class": "OCA\DAV\Upload\ChunkingPlugin", "type": "->", "args": [ "uploads/john.doe@google.com/web-file-upload-d86cd6c9f67b3659df8b97bce3fdc5a4-1656417981137/.file", "files/john.doe@google.com/3/5GB.file" ] }, { "file": "/var/www/html/3rdparty/sabre/dav/lib/DAV/CorePlugin.php", "line": 603, "function": "emit", "class": "Sabre\DAV\Server", "type": "->", "args": [ "beforeMove", [ "uploads/john.doe@google.com/web-file-upload-d86cd6c9f67b3659df8b97bce3fdc5a4-1656417981137/.file", "files/john.doe@google.com/3/5GB.file" ] ] }, { "file": "/var/www/html/3rdparty/sabre/event/lib/WildcardEmitterTrait.php", "line": 89, "function": "httpMove", "class": "Sabre\DAV\CorePlugin", "type": "->", "args": [ { "class": "Sabre\HTTP\Request" }, { "class": "Sabre\HTTP\Response" } ] }, { "file": "/var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php", "line": 472, "function": "emit", "class": "Sabre\DAV\Server", "type": "->", "args": [ "method:MOVE", [ { "class": "Sabre\HTTP\Request" }, { "class": "Sabre\HTTP\Response" } ] ] }, { "file": "/var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php", "line": 253, "function": "invokeMethod", "class": "Sabre\DAV\Server", "type": "->", "args": [ { "class": "Sabre\HTTP\Request" }, { "class": "Sabre\HTTP\Response" } ] }, { "file": "/var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php", "line": 321, "function": "start", "class": "Sabre\DAV\Server", "type": "->", "args": [] }, { "file": "/var/www/html/apps/dav/lib/Server.php", "line": 352, "function": "exec", "class": "Sabre\DAV\Server", "type": "->", "args": [] }, { "file": "/var/www/html/apps/dav/appinfo/v2/remote.php", "line": 35, "function": "exec", "class": "OCA\DAV\Server", "type": "->", "args": [] }, { "file": "/var/www/html/remote.php", "line": 166, "args": [ "/var/www/html/apps/dav/appinfo/v2/remote.php" ], "function": "require_once" } ], "File": "/var/www/html/lib/private/Lock/MemcacheLockingProvider.php", "Line": 85, "CustomMessage": "--" } }

Other threads can be found with only one workaround, which is: Disable File locking or upping the Filelock ttl (did not work for us). Redis on Debug says nothing at all.

Some other Threads I've found: nextcloud/server#23615 nextcloud/server#9001

Update:

  'part_file_in_storage' => true,
  'filesystem_check_changes' => 1,
  'filelocking.ttl' => 43200,

didn't change anything...

k-popov commented 2 years ago

I've also noticed that desktop client (Linux version) and nextcloudcmd (nextcloud-desktop-cmd Ubuntu package name) are somehow able to sync these large files. Aren't these clients using WebDAV? Are they using it differently?

Doriangaensslen commented 1 year ago

Still exists on 24.0.6

Doriangaensslen commented 1 year ago

Turns out, that the MOVE Command send by the Browser fails every time exactly after 60 Seconds (seen by developer Tools in the Browser).

Our config was wrong and did not use redis at all. Now that redis is used for the filelocking, i see only entrys after 60 seconds after the Upload. The problem still persists. When I refresh the browser window, the most cases the files shows right away.

I tried:

Deactivating antivirus
Scale to 1 Pod
Tested performance on NFS (we use that as storage backend)
using \OC\Memcache\Redis</memcache.local> instead of apcu

... no luck so far

Doriangaensslen commented 1 year ago

The issue with the MOVE Command that fails after 60 Seconds has to be the Ingress (in our case nginx). We added the following annotations to the ingress and checked with the nginx pod in nginx.conf if all the options are applied.

    nginx.ingress.kubernetes.io/cors-allow-headers: X-Forwarded-For
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-buffer-size: 225m
    nginx.ingress.kubernetes.io/proxy-buffering: "on"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: 60s
    nginx.ingress.kubernetes.io/proxy-read-timeout: "1800"
    nginx.ingress.kubernetes.io/proxy-request-buffering: "on"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "1800"

Which seem to improve the situation and notably the speed of the uploads. But with the caveat that we could not upload files bigger than 10Gbs so far with our config (so surely this config is still not optimal).

However, with the MOVE command from the browser, the Pod is busy and using all its resources for that and Apache is failing to serve the status.php, which resulted in our case in a killed Pod (failing Readiness Probes). As a workaround, we upped the timeout time in the probes. It's unclear why the MOVE takes so many resources, that apache/php construct on the nextcloud container can't handle even an easy curl on status.php. It seems like that one locks out all other processing of PHP code in this time.

ondrejlohnisky commented 1 year ago

sudo redis-cli flushall

szaimen commented 1 year ago

Hi, please update to 24.0.9 or better 25.0.3 and report back if it fixes the issue. Thank you!

My goal is to add a label like e.g. 25-feedback to this ticket of an up-to-date major Nextcloud version where the bug could be reproduced. However this is not going to work without your help. So thanks for all your effort!

If you don't manage to reproduce the issue in time and the issue gets closed but you can reproduce the issue afterwards, feel free to create a new bug report with up-to-date information by following this link: https://github.com/nextcloud/server/issues/new?assignees=&labels=bug%2C0.+Needs+triage&template=BUG_REPORT.yml&title=%5BBug%5D%3A+

jorgegv commented 1 year ago

Hi, my installation was exactly as the above one, with NC 24.0.5, I was experimenting the problem and I stumbled upon this issue.

On 24.0.5 I tried to upload a 2.2GB file, and got the locking error mentioned above. The file was correctly uploaded though (checked: I downloaded it again and checked it matched the original). I upped the locking TTL to 12h (as suggested above), but still the locking error after uploading persists.

As suggested I upgraded my NC to 25.0.3 (Helm chart version 3.5.0), but the problem is still the same: upload a 2.2 GB file, lock error but file successfully uploaded.

szaimen commented 1 year ago

As suggested I upgraded my NC to 25.0.3 (Helm chart version 3.5.0), but the problem is still the same: upload a 2.2 GB file, lock error but file successfully uploaded.

Sounds like you are running into some kind of timeout. Can you make sure that all of the mentioned values in https://docs.nextcloud.com/server/latest/admin_manual/configuration_files/big_file_upload_configuration.html are correctly applied?

jorgegv commented 1 year ago

These are my current values:

php_value upload_max_filesize 5G
php_value post_max_size 5G
php_value max_input_time 3600
php_value max_execution_time 3600

And also:

'filelocking.enabled' => true,
'filelocking.ttl' => 43200

These values seem enough for a 2.2GB upload, right?

It does not seem a timeout to me... the upload finishes successfully (but shows the locking error), and the file appears in the web interface afterwards. Indeed, if I download the file, the content matches the original byte for byte.

szaimen commented 1 year ago

Did you also adjust web server or proxy timeouts?

jorgegv commented 1 year ago

My proxy is a nginx Ingress in Kubernetes, but the timeouts that can be configured are for establishing connections, not for limiting the duration of those connections.

Again, the upload finishes successfully (HTTP wise) so there seems not to be any timeout or maximum data size problems in the data path...?

Doriangaensslen commented 1 year ago

What we discovered on k8s with almost unlimited timeoutsettings:

File uploads normaly and finishes also uploading
browser sends MOVE
on the running pods you see in the data directory that the chunks are stiched together
when the operation is finished, the browser receives the error and prints the message
press F5 and the file is there
Different deployment methods (eg.docker) have this problem too

jorgegv commented 1 year ago

This is exactly my failure scenario.

jorgegv commented 1 year ago

Some more tests on this front: I installed a temporary VM with:

Rocky Linux 8 latest (RHEL clone)
PHP 8.1.16 running in FPM mode (this is the same PHP as the one in the Docker image which is used for deployment in K8S)
Apache
External MariaDB server
Vanilla Nextcloud 25.0.3 minimally installed via Webinstaller, with no optimizations (i.e. no opcache, no memcache, stock DB-based locking)
(No docker, no kubernetes, just plain old "hosted" installation)

With this setup, the same upload as indicated above (2.2GB) works flawlessly, and shows no lock errors after uploading.

After doing this, I modified the installation above:

Added and configured opcache
Added and configured local redis server
Configured Nextcloud memcache using Redis
Configured Nextcloud locking using Redis
Fixed all security and performance recommendations in the Admin dashboard (now it shows a green tick)

I retried the same 2.2GB upload, and again it worked flawlessly, with no locking errors shown after upload.

jorgegv commented 1 year ago

And another new test: another temporary VM with:

Rocky Linux 8 latest
Docker
Nextcloud installed using AIO image

With this setup, the same upload as indicated above (2.2GB) works flawlessly, and shows no lock errors after uploading.

So:

Webinstaller/ZIP install works OK
AIO install works OK

Clues so far point at some specific problem with the Helm/Kubernetes deployment.

One of my team mates suggested it may have something to do with Redis being deployed in cluster mode in Kubernetes? There are several Pods for it, while in a VM or AIO install only a single Redis instance is deployed... I'll test and report back.

jorgegv commented 1 year ago

New test: deployed in our K8S staging cluster adding redis configuration architecture: standalone to the Helm chart config. I verified that now only one redis master Pod is deployed instead of the 4 previous ones (1 master and 3 slaves).

The error still happens with single node Redis deployment.

szaimen commented 1 year ago

Rocky Linux 8 latest (RHEL clone)

PHP 8.1.16 running in FPM mode (this is the same PHP as the one in the Docker image which is used for deployment in K8S)

Apache

External MariaDB server

Vanilla Nextcloud 25.0.3 minimally installed via Webinstaller, with no optimizations (i.e. no opcache, no memcache, stock DB-based locking)

(No docker, no kubernetes, just plain old "hosted" installation)

With this setup, the same upload as indicated above (2.2GB) works flawlessly, and shows no lock errors after uploading.

After doing this, I modified the installation above:

Added and configured opcache

Added and configured local redis server

Configured Nextcloud memcache using Redis

Configured Nextcloud locking using Redis

Fixed all security and performance recommendations in the Admin dashboard (now it shows a green tick)

I retried the same 2.2GB upload, and again it worked flawlessly, with no locking errors shown after upload.

So basically it looks like the docker image or helm chart might have an issue, right?

jorgegv commented 1 year ago

So basically it looks like the docker image or helm chart might have an issue, right?

Well, according to my tests (see previous comments), the Docker image is fine, it installs and runs with no errors on a regular instance (Rocky VM). Error seems to be specific to K8S deployment.

szaimen commented 1 year ago

Error seems to be specific to K8S deployment.

All right. Shall I move over to the helm chart repo then?

jorgegv commented 1 year ago

All right. Shall I move over to the helm chart repo then?

That's my current line of investigation, yes.

BTW, @Doriangaensslen said that the problem happened also with the DOcker AIO image, but I did not reproduce it there.

szaimen commented 1 year ago

Ah, you tried the AIO image instead of the "normal" image? That could still lead to the docker image being the culprit here as AIO is using its own image. Moving to the docker repo then for now.

jorgegv commented 1 year ago

I did not know there were 2 docker images, now I know, thanks :-)

jorgegv commented 1 year ago

New test: deployed Nextcloud Helm chart with Redis disabled, and retried the 2.2GB upload.

An error about locking appears, though different from the previous one, and more terse: ""TEST20_3.bin" is locked"

As with the previous errors, if you refresh, the file has been correctly uploaded and appears in the web UI.

So it seems we can discard Redis as the source of the problem, it is somewhere else.

Doriangaensslen commented 1 year ago

It's also what we experienced. We also tried with the other cashing methods, no luck there.

Doriangaensslen commented 1 year ago

We noticed restarts, because the readiness probe failed when the file was stiched toghether, Because the pod was busy processing it. But still after resolving that, we still have this problem.

Y3Z0N commented 1 year ago

i have the same issue "file locked: exclusive" using iONOS as host.

pjft commented 1 year ago

I have intermittent issues on a "normal" (non-docker, non-k8s) setup, also using Redis. Didn't occur in 22.x, but since I installed 25 I get that every now and then, without meaningful ways to replicate it. It happens often when doing the auto-backup of photos via the iOS app - which is interesting as the files do not exist in the first place.

I am tempted to disable file-locking, as this is almost a single-user instance (well, the other user is a family member, and it's really only for auto-backups of photos). Still, it'd be better if this would help detecting any sort of issue here.

Happy to help, as always.

brainchild0 commented 1 year ago

I am experiencing essentially the same problem, running Nextcloud 27.0.1 directly on an Apache instance, without any containers.

I have kept the upload and post sizes at their low defaults of 50M, due to resource limitations on the server. Files are stored in S3.

Appearing in the log file...

OCP\Lock\LockedException: "<pathname>" is locked, existing lock on file: 3 shared locks

Similarly, shown as a message popup on the browser page...

"\<pathname>" is locked, existing lock on file: 3 shared locks

mueller-tobias commented 1 year ago

Hi there, we've the same problem with 27.0.1-apache image. We're using a K8S Cluster hosted at IONOS. We use an nginx-ingress controller in front of the pod with the following annotations:

    nginx.ingress.kubernetes.io/cors-allow-headers: X-Forwarded-For
    nginx.ingress.kubernetes.io/enable-cors: 'true'
    nginx.ingress.kubernetes.io/proxy-body-size: 0m
    nginx.ingress.kubernetes.io/proxy-buffer-size: 225m
    nginx.ingress.kubernetes.io/proxy-buffering: 'on'
    nginx.ingress.kubernetes.io/proxy-connect-timeout: 60s
    nginx.ingress.kubernetes.io/proxy-read-timeout: 900s
    nginx.ingress.kubernetes.io/proxy-request-buffering: 'on'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'

joshtrichards commented 11 months ago

I think this is an issue in the Files client embedded in server though I admit only can come up with speculation why this is mostly reported with k8s+docker image.

It appears the reason that the client reported in this thread as being okay (desktop) works is because it does treat 423 as temporary and simply retries:

https://github.com/nextcloud/desktop/blob/3b92054f0341040c6108945dfa559d54d8c557eb/src/libsync/owncloudpropagator_p.h#L102-L109

The ones that fail (web, iOS, Android) don't seem to from a cursory look, but I haven't dug deeply enough say with certainty yet:

web: https://github.com/nextcloud/server/blob/0fe7064fc4fbf15272031f852ffbdb6ac08bc6ef/apps/files/js/file-upload.js#L310
iOS?
Android?

Open questions:

Why is this occurring more often on the micro- services Docker image on k8s?
- Best guess, since the files seem to be getting actually uploaded successfully, is that the client is asking before the file chunks are reassembled and the locks released... maybe due to network filesystem / object storage latency?
The real question is: If the web Files client is modified to retry just like the desktop client, is the underlying root cause fixed or is it just pushing past the matter?

Assuming treating 423 as a soft failure is at least one part of the correct answer, changes/confirmation should be coordinated across all clients (other than Desktop).

pjft commented 11 months ago

I am under the impression that iOS will retry up to a point (or at least used to), but recently I found a set of files that were supposed to have been automatically backed up but weren't and I had to force a full photo library upload. Might not be related at all, but since I still see 423 blocking things every now and then I thought I'd chime in.

jorgegv commented 11 months ago

You may be on the good track wrt storage latency: the (most) problematic installations seem to be on docker, kubernetes and using S3 storage. All of those have higher (or much higher!) latency than local disks.

I hope you can get this sorted out soon. Good luck :-)

McGeaverBeaver commented 11 months ago

I am using lscr.io/linuxserver/nextcloud:latest , I have this error for the first time using the iOS client.

HammyHavoc commented 11 months ago

I am using lscr.io/linuxserver/nextcloud:latest , I have this error for the first time using the iOS client.

Seeing this problem with the LinuxServer Docker image too. Not sure if that's just coincidence though. Seeing it on the Windows desktop client too.

Orhideous commented 11 months ago

This is a bug, not a question or a configuration/webserver/proxy issue.

Wrong. This is a proxy, webserver and (most likely) configuration issue, albeit a subtle one. Manually removing file locking will result in messed up data.

tl;dr — increase web server timeouts and/or get rid of slow storage, at least for caching.

There is a lot to unpack.

There's a Chunked file upload v2: it's a way for NC to upload large files. It boils down to a simple concept: split big_chungus_16g.bin into chunks, upload them to any location, then reassemble them into a temporary file on the server side, and then move them to the desired location. Despite the complexity of the process, the above problem doesn't often manifest itself in «simple» installations; by «simple» I mean a one server, fast enough storage, no bells and whistles.

By default, NC uses a reasonable chunk size (~10M). Even in the worst case, with buffering enabled in the web server (e.g. nginx) AND buffering for FastCGI, a chunk will first end up in PHP's temporary location. Then that file moves to a write-only cache for uploads (.../UserName/uploads/1234567/00000, .../UserName/uploads/1234567/00001, and so on). — as it should be.

Once all the pieces are in place, the client issues the MOVE verb, signalling the server to 1) reassemble all the pieces from the staging dir for uploads, say .../UserName/uploads/1234567/*, into a single intermediate file by sequentially copying the contents of the chunks, and 2) move that intermediate file to the final destination. 3) return a successful response to the client.

So where and why can this process be broken?

First and foremost, all that server-side byte shuffling after the MOVE verb happens in the lifetime of a single request. If anything happens — the upload is screwed. If I/O takes too long (longer than the timeout the web server expects from the FastCGI socket) — upload is screwed. By «screwed» I mean inconsistent state, wasted space — such a PITA for large files, and... Enter, lock. NC is pretty paranoid about the atomicity of I/O operations, all those intermediate files, chunks and so on are locked by default, and that makes sense: if something messes with intermediate files during upload, then, as you might guess, it gets screwed up in unpredictable ways. So if for some reason reassembling the file took too long, nginx or some other web server will tell the client «504 Gateway Timeout, lol». There will be no one to release the lock on the intermediate file, it may or may not be moved to the target — upload is screwed. But there's more. Clients and retry logic. Clients are pretty ~~stupid~~ naїve, and will happily retry uploads even if there are server-side inconsistencies. Again and again, they will stumble upon a held lock until it expires or someone manually deletes it.

I reiterate: overall, this seemingly fragile process works as expected, and locking works as expected. Modern storage can easily deliver 300MB/s sequential I/O, so assembling, say, a 4G file will fit nicely even in default humble 30s timeout. If the NC data folder lives on a medium with scarce IOPS (NFS, Longhorn, whatever), then even trivial operations that normally happen in the boundaries of a disc will become network calls, so moving chunks back and forth will turn uploads of anything larger than 1G into PITA.

To fix this problem and make efficient use of all available bandwidth, one needs to

Tweak runtime: get rid of PHP upload limits, increase input/execution time to 1 hour (overkill, but still), AND use a fast location for upload_tmp_dir — NVMe or even tmpfs if you can afford it for >1Gbps upload speed.
Tweak NC itself: use big chunks up to 4GB instead of 10MB, or disable chunking altogether at your discretion;
Tweak webserver timeouts: reassembling a 16G file can take a few minutes on a consumer grade HDD, so set fastcgi_send_timeout/fastcgi_read_timeout in nginx to a reasonable value, say 600s. And don't forget to disable buffering at all.
In case of NFS baked data directory, there is a dirty hack: replacing upload/ directory with symlink to fast local storage ln -s /scratch uploads in conjunction with localstorage.allowsymlinks will save precious bandwidth and IOPS.

With all this tweaks NC can easily saturate a 1Gbps link, even on modest hardware. It just needs to be configured properly. Unfortunately, the default configuration isn't the best.

P.S. The mobile client defaults to small chunks with no way to configure them. That's why video uploads on 300Mbps LTE are painfully slow.

HammyHavoc commented 11 months ago

This is a bug, not a question or a configuration/webserver/proxy issue.

Wrong. This is a proxy, webserver and (most likely) configuration issue, albeit a subtle one. Manually removing file locking will result in messed up data.

tl;dr — increase web server timeouts and/or get rid of slow storage, at least for caching.

There is a lot to unpack.

There's a Chunked file upload v2: it's a way for NC to upload large files. It boils down to a simple concept: split big_chungus_16g.bin into chunks, upload them to any location, then reassemble them into a temporary file on the server side, and then move them to the desired location. Despite the complexity of the process, the above problem doesn't often manifest itself in «simple» installations; by «simple» I mean a one server, fast enough storage, no bells and whistles.

By default, NC uses a reasonable chunk size (~10M). Even in the worst case, with buffering enabled in the web server (e.g. nginx) AND buffering for FastCGI, a chunk will first end up in PHP's temporary location. Then that file moves to a write-only cache for uploads (.../UserName/uploads/1234567/00000, .../UserName/uploads/1234567/00001, and so on). — as it should be.

Once all the pieces are in place, the client issues the MOVE verb, signalling the server to

reassemble all the pieces from the staging dir for uploads, say .../UserName/uploads/1234567/*, into a single intermediate file by sequentially copying the contents of the chunks, and

move that intermediate file to the final destination.

return a successful response to the client.

So where and why can this process be broken?

First and foremost, all that server-side byte shuffling after the MOVE verb happens in the lifetime of a single request. If anything happens — the upload is screwed. If I/O takes too long (longer than the timeout the web server expects from the FastCGI socket) — upload is screwed. By «screwed» I mean inconsistent state, wasted space — such a PITA for large files, and... Enter, lock. NC is pretty paranoid about the atomicity of I/O operations, all those intermediate files, chunks and so on are locked by default, and that makes sense: if something messes with intermediate files during upload, then, as you might guess, it gets screwed up in unpredictable ways. So if for some reason reassembling the file took too long, nginx or some other web server will tell the client «504 Gateway Timeout, lol». There will be no one to release the lock on the intermediate file, it may or may not be moved to the target — upload is screwed. But there's more. Clients and retry logic. Clients are pretty ~stupid~ naїve, and will happily retry uploads even if there are server-side inconsistencies. Again and again, they will stumble upon a held lock until it expires or someone manually deletes it.

I reiterate: overall, this seemingly fragile process works as expected, and locking works as expected. Modern storage can easily deliver 300MB/s sequential I/O, so assembling, say, a 4G file will fit nicely even in default humble 30s timeout. If the NC data folder lives on a medium with scarce IOPS (NFS, Longhorn, whatever), then even trivial operations that normally happen in the boundaries of a disc will become network calls, so moving chunks back and forth will turn uploads of anything larger than 1G into PITA.

To fix this problem and make efficient use of all available bandwidth, one needs to

Tweak runtime: get rid of PHP upload limits, increase input/execution time to 1 hour (overkill, but still), AND use a fast location for upload_tmp_dir — NVMe or even tmpfs if you can afford it for >1Gbps upload speed.

Tweak NC itself: use big chunks up to 4GB instead of 10MB, or disable chunking altogether at your discretion;

Tweak webserver timeouts: reassembling a 16G file can take a few minutes on a consumer grade HDD, so set fastcgi_send_timeout/fastcgi_read_timeout in nginx to a reasonable value, say 600s. And don't forget to disable buffering at all.

In case of NFS baked data directory, there is a dirty hack: replacing upload/ directory with symlink to fast local storage ln -s /scratch uploads in conjunction with localstorage.allowsymlinks will save precious bandwidth and IOPS.

With all this tweaks NC can easily saturate a 1Gbps link, even on modest hardware. It just needs to be configured properly. Unfortunately, the default configuration isn't the best.

P.S. The mobile client defaults to small chunks with no way to configure them. That's why video uploads on 300Mbps LTE are painfully slow.

Thanks for the thoughtful and detailed response, really appreciative of your time and effort!

Just wanted to say that I saw it on a single very small file (a few hundred kilobyte sfx for a game I'm working on) out of a folder of much bigger several GB files. I copy-pasted the file elsewhere then simply deleted the file then added it again and it worked first-time, so not convinced that it's an IOPS problem, think it's a bug of some description. I've been running Nextcloud a few years now and never encountered this issue, even uploading much larger files (like backups/snapshots!).

So whilst I agree with a lot of what you're saying, I don't think it's necessarily applicable to every instance of people seeing file locking. Nothing in the log either that shines any light on it.

chainria commented 9 months ago

Thank you for all the helpful hints. However... I come bearing bad news.

I have NFS-backed storage, yes, it's not the fastest. However: I mounted the uploads/ directory on it's own 6 GB sized tmpfs. It gets used fine, but this doesn't change a thing at all. For me, this means Nextcloud is fundamentally broken for uploading big files via mobile (e.g. Videos) In this example, I managed to clog up my instance with a few Christmas and New Years' videos, 1-2GB in size. For the short term, I'm not syncing movies any more, for the long term I will migrate back to local storage. This however, is a very, very sore wound. And it's not the only one, but those are other bug reports entirely. Just my two cents.

No timeouts in Log, only 405 and 423 (Locked), btw.

pschichtel commented 8 months ago

I have this issue on a fixed set of rather small files (music files, so 20-100MB), while I had no issues uploading several multi-gigabyte files. I'm running the 28.0.2-apache image in k8s behind ingress-nginx (deployed with nextcloud/helm) with user files on ceph-rgw via S3. I've significantly increased the timeouts on the ingress, but that did not change anything.

pschichtel commented 8 months ago

also: when I copy one of the affected files, the copy is uploaded without issues.

pschichtel commented 8 months ago

Deleting the /chunking-v2 keys in redis gets rid of this shared lock issue (which seems odd), but the desktop client still tries to upload all affected files (~200). During that sync new /chunking-v2 keys show up which will lead to failed uploads due to a shared lock once again, however not as many.

henrythasler commented 6 months ago

I am experiencing the same issue ("file.bin" is locked, existing lock on file: 3 shared locks) on a Ionos Managed Nextcloud instance (Nextcloud 27.1.5 Enterprise).

I did some tests and I can reliably reproduce this with any file that is larger than exactly 10MB using the web-ui or Linux Client. Up to 10MB it's quite fast but larger files upload very slow, show http 500 responses and eventually lead to the lock error. I also observe that my instance is often very slow even to list the few files that I have stored.

I therefore conclude that @Orhideous's summary is correct in my case. I will get in touch with Ionos support.

DaDirnbocher commented 6 months ago

. I will get in touch with Ionos support.

@henrythasler Any solution from ionos? i think i have the same issue but only with the windows client, linux works fine.

henrythasler commented 6 months ago

@DaDirnbocher No, not yet. I talked to a very helpful and technically skilled support guy last week, explained the problem and provided some additional info/logs. I received some feedback yesterday that the "performance of my Nextcloud was increased" and I should try again. Unfortunately not much had changed. UI was a bit less sluggish but the error is still there.

For what it's worth for anyone else, here is a log entry that shows up with the error: lock.json

I should mention that this error only occurs when overwriting an already existing file. Fresh upload always work fine.

henrythasler commented 5 months ago

I'm beginning to suspect that this issue here is somehow connected to https://github.com/nextcloud/server/issues/6029 where people also report very bad server performance when incoming federated shares are not reachable (as is the case on my instance).

pschichtel commented 5 months ago

Btw small update from me: I switch from my ceph S3 to cephfs (everything else is the same) and now all issues are gone. To me it seems like the latency of the S3 backend might be too high for some reason. I don't think nextcloud/server#6029 is related.

henrythasler commented 5 months ago

It seems the nexcloud-client has some influence on the multipart-upload. As a workaround/hack I have set chunkSize=20000000 in nextcloud.cfg and now my 15MB files sync w/o any error.

nextcloud / docker