nextcloud / server

☁️ Nextcloud server, a safe home for all your data
https://nextcloud.com
GNU Affero General Public License v3.0
27.33k stars 4.06k forks source link

Share mounts are killing server disk IO #35044

Closed solracsf closed 1 year ago

solracsf commented 2 years ago

How to use GitHub


Steps to reproduce

I don't really know. Server was having, since a while, a grow in IO operations. Today, we've debugged it. There are thousands of disk lookup operations per second undergoing, related to Talk files (files shared in Talk rooms only, they're not present anywhere else, we know it by their filenames). All these lookup operations are failing. Small extract of not even one entire second:

2022.11.02 20:08:45.912606 [uid:1030,gid:1030,pid:270551] lookup (4494,IMG_1658908017919.jpg): no such file or directory <0.001306>
2022.11.02 20:08:45.913161 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907913331.jpg): no such file or directory <0.001412>
2022.11.02 20:08:45.913298 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03151.jpg): no such file or directory <0.001219>
2022.11.02 20:08:45.913971 [uid:1030,gid:1030,pid:270167] lookup (4494,IMG_1658908141644.jpg): no such file or directory <0.001577>
2022.11.02 20:08:45.914043 [uid:1030,gid:1030,pid:270425] lookup (4494,IMG_0316.jpg): no such file or directory <0.001315>
2022.11.02 20:08:45.914153 [uid:1030,gid:1030,pid:270551] lookup (4494,IMG_1658908017919.jpg): no such file or directory <0.001433>
2022.11.02 20:08:45.914703 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03151.jpg): no such file or directory <0.001217>
2022.11.02 20:08:45.914723 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907913331.jpg): no such file or directory <0.001437>
2022.11.02 20:08:45.915604 [uid:1030,gid:1030,pid:270425] lookup (4494,IMG_0316.jpg): no such file or directory <0.001477>
2022.11.02 20:08:45.915680 [uid:1030,gid:1030,pid:270167] lookup (4494,IMG_1658908141644.jpg): no such file or directory <0.001593>
2022.11.02 20:08:45.915944 [uid:1030,gid:1030,pid:270551] lookup (4494,IMG_1658908017919.jpg): no such file or directory <0.001620>
2022.11.02 20:08:45.916407 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03151.jpg): no such file or directory <0.001554>
2022.11.02 20:08:45.916604 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907905623.jpg): no such file or directory <0.001554>
2022.11.02 20:08:45.917162 [uid:1030,gid:1030,pid:270425] lookup (4494,IMG_0316.jpg): no such file or directory <0.001496>
2022.11.02 20:08:45.917267 [uid:1030,gid:1030,pid:270167] lookup (4494,IMG_1658908141644.jpg): no such file or directory <0.001525>
2022.11.02 20:08:45.917853 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03151.jpg): no such file or directory <0.001351>
2022.11.02 20:08:45.918041 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907905623.jpg): no such file or directory <0.001263>
2022.11.02 20:08:45.918177 [uid:1030,gid:1030,pid:270551] lookup (4494,IMG_1658908013005.jpg): no such file or directory <0.001756>
2022.11.02 20:08:45.918571 [uid:1030,gid:1030,pid:270425] lookup (4494,IMG_0316.jpg): no such file or directory <0.001361>
2022.11.02 20:08:45.918953 [uid:1030,gid:1030,pid:270167] lookup (4494,IMG_1658908141644.jpg): no such file or directory <0.001638>
2022.11.02 20:08:45.919242 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03151.jpg): no such file or directory <0.001245>
2022.11.02 20:08:45.919298 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907905623.jpg): no such file or directory <0.001143>
2022.11.02 20:08:45.919984 [uid:1030,gid:1030,pid:270551] lookup (4494,IMG_1658908013005.jpg): no such file or directory <0.001614>
2022.11.02 20:08:45.920420 [uid:1030,gid:1030,pid:270167] lookup (4494,IMG_1658908141644.jpg): no such file or directory <0.001391>
2022.11.02 20:08:45.920635 [uid:1030,gid:1030,pid:270425] lookup (4494,IMG_0316.jpg): no such file or directory <0.002010>
2022.11.02 20:08:45.921041 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03101.jpg): no such file or directory <0.001439>
2022.11.02 20:08:45.921352 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907905623.jpg): no such file or directory <0.001765>
2022.11.02 20:08:45.921787 [uid:1030,gid:1030,pid:270551] lookup (4494,IMG_1658908013005.jpg): no such file or directory <0.001682>
2022.11.02 20:08:45.922081 [uid:1030,gid:1030,pid:270167] lookup (4494,IMG_1658908439250.jpg): no such file or directory <0.001450>
2022.11.02 20:08:45.922136 [uid:1030,gid:1030,pid:270425] lookup (4494,IMG_03151.jpg): no such file or directory <0.001355>
2022.11.02 20:08:45.922697 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03101.jpg): no such file or directory <0.001463>
2022.11.02 20:08:45.922803 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907905623.jpg): no such file or directory <0.001317>
2022.11.02 20:08:45.923091 [uid:1030,gid:1030,pid:270551] lookup (4494,IMG_1658908013005.jpg): no such file or directory <0.001242>
2022.11.02 20:08:45.923569 [uid:1030,gid:1030,pid:270167] lookup (4494,IMG_1658908439250.jpg): no such file or directory <0.001328>
2022.11.02 20:08:45.923681 [uid:1030,gid:1030,pid:270425] lookup (4494,IMG_03151.jpg): no such file or directory <0.001439>
2022.11.02 20:08:45.924510 [uid:1030,gid:1030,pid:270164] lookup (4494,IMG_03101.jpg): no such file or directory <0.001729>
2022.11.02 20:08:45.924651 [uid:1030,gid:1030,pid:270601] lookup (4494,IMG_1658907894938.jpg): no such file or directory <0.001471>

Confirmed by searching these files on DB, example here (on oc_share table):

image

And on NC interface:

image

If it helps, one simple way we've found to fire these lookup calls (not the only one, as there are millions of these operations per day, 12 users on server, and only one admin can access this page) is browsing to /settings/users page.

You've guessed it; disabling Talk app disables all these calls.

We've run files:scan files:scan-app-data files:cleanup no help here.

Expected behaviour

This should not happen

Talk app

Talk app version: 13.0.9

Custom Signaling server configured: yes

Custom TURN server configured: yes

Custom STUN server configured: yes

Operating system: Ubuntu 22.04

Web server: Apache

Database: MariaDB

PHP version: 8.0

Nextcloud Version: 23.0.10

nickvergessen commented 2 years ago

Can you attach a system report? Especially storage information would be useful.

solracsf commented 2 years ago

Sure, what do you call a system report exactly?

nickvergessen commented 2 years ago

The support app can generate it. Otherwise describe your set up a bit, are you using external store? Which other apps are enabled? Do you have Redis/Memcache? Are you also using it for Locking?

Is there anything in your nextcloud.log file that indicates an error or results in data not being written or something?

solracsf commented 2 years ago

My setup is pretty classic; Nextcloud 23.0.10, no external storage at all. This is a small dedicated server, 4 Core CPU @ 3Ghz, 16Gib RAM, entirely dedicated to Nextcloud. Nextcloud is using a local Redis to both cache and locking, MariaDB as DB, PHP-FPM with Opcache, and Apache 2.4. Vanilla install from docs. 12 users on the instance.

Log is full of https://github.com/nextcloud/server/issues/33919 but nothing else is logged, errors or warnings of any kind.

Just to take an idea, here is the relationship between consumed CPU by these insane lookups.

image

Talk is by far the mostly used app on the server. While these lookups are going on, the Apache webserver logs these requests only:

GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
GET "/ocs/v2.php/apps/spreed/api/v1/chat/3coj2477?setReadMarker=0&lookIntoFuture=1&lastKnownMessageId=40258&includeLastKnown=0"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
GET "/ocs/v2.php/apps/spreed/api/v1/chat/39dwbny4?setReadMarker=0&lookIntoFuture=1&lastKnownMessageId=40206&includeLastKnown=0"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
GET "/ocs/v2.php/apps/spreed/api/v1/chat/39dwbny4?setReadMarker=0&lookIntoFuture=1&lastKnownMessageId=40206&includeLastKnown=0"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
PUT "/apps/user_status/heartbeat"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"

The PID on the lookups screenshot always corresponds to a php-fpm process.

Report here below.

Server configuration detail

Operating system: Linux 5.15.0-52-generic nextcloud/spreed#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64

Webserver: Apache

Database: MariaDB 10.6

PHP version: 8.0.25

Modules loaded: Core, date, libxml, openssl, pcre, zlib, filter, hash, json, Reflection, SPL, session, standard, sodium, cgi-fcgi, mysqlnd, PDO, xml, bcmath, calendar, ctype, curl, dom, mbstring, FFI, fileinfo, ftp, gd, gettext, gmp, iconv, igbinary, imagick, intl, ldap, exif, msgpack, mysqli, pdo_mysql, pdo_sqlite, Phar, posix, readline, redis, shmop, SimpleXML, soap, sockets, sqlite3, sysvmsg, sysvsem, sysvshm, tokenizer, xmlreader, xmlwriter, xsl, zip, Zend OPcache

Nextcloud version: 23.0.10 - 23.0.10.1

Updated from an older Nextcloud/ownCloud or fresh install:

Where did you install Nextcloud from: unknown

Signing status Array ( )
List of activated apps ``` Enabled: - accessibility: 1.9.0 - activity: 2.15.0 - apporder: 0.15.0 - calendar: 3.5.2 - cloud_federation_api: 1.6.0 - comments: 1.13.0 - contactsinteraction: 1.4.0 - dashboard: 7.3.0 - dav: 1.21.0 - deck: 1.6.4 - external: 3.10.2 - federatedfilesharing: 1.13.0 - files: 1.18.0 - files_pdfviewer: 2.4.0 - files_rightclick: 1.2.0 - files_sharing: 1.15.0 - files_trashbin: 1.13.0 - files_versions: 1.16.0 - files_videoplayer: 1.12.0 - forms: 2.5.1 - impersonate: 1.10.0 - logreader: 2.8.0 - lookup_server_connector: 1.11.0 - notifications: 2.11.2 - oauth2: 1.11.0 - onlyoffice: 7.4.6 - password_policy: 1.13.0 - provisioning_api: 1.13.0 - settings: 1.5.0 - sharebymail: 1.13.0 - spreed: 13.0.9 - support: 1.6.0 - systemtags: 1.13.0 - theming: 1.14.0 - twofactor_backupcodes: 1.12.0 - twofactor_totp: 6.4.1 - user_status: 1.3.1 - viewer: 1.7.0 - workflowengine: 2.5.0 Disabled: - admin_audit - circles: 22.3.1 - encryption - federation: 1.7.0 - files_external: 1.11.1 - firstrunwizard: 2.6.0 - nextcloud_announcements: 1.6.0 - notify_push - photos: 1.0.0 - privacy: 1.4.0 - recommendations: 0.5.0 - serverinfo: 1.7.0 - side_menu: 2.5.1 - survey_client: 1.5.0 - text: 3.4.1 - updatenotification: 1.12.0 - user_ldap - weather_status: 1.0.0 ```
Configuration (config/config.php) ``` { "hashingThreads": 4, "hashingMemoryCost": 77824, "hashingTimeCost": 3, "passwordsalt": "***REMOVED SENSITIVE VALUE***", "secret": "***REMOVED SENSITIVE VALUE***", "trusted_domains": [ "***REMOVED SENSITIVE VALUE***" ], "datadirectory": "***REMOVED SENSITIVE VALUE***", "dbtype": "mysql", "version": "23.0.10.1", "overwrite.cli.url": "***REMOVED SENSITIVE VALUE***", "dbtableprefix": "oc_", "mysql.utf8mb4": true, "dbname": "***REMOVED SENSITIVE VALUE***", "dbhost": "***REMOVED SENSITIVE VALUE***", "dbuser": "***REMOVED SENSITIVE VALUE***", "dbpassword": "***REMOVED SENSITIVE VALUE***", "installed": true, "skeletondirectory": "", "default_language": "fr", "default_locale": "fr_FR", "activity_expire_days": 7, "auth.bruteforce.protection.enabled": false, "htaccess.RewriteBase": "\/", "knowledgebaseenabled": false, "logtimezone": "Europe\/Paris", "maintenance": false, "memcache.local": "\\OC\\Memcache\\Redis", "memcache.locking": "\\OC\\Memcache\\Redis", "updatechecker": false, "upgrade.disable-web": true, "overwriteprotocol": "https", "preview_max_scale_factor": 1, "quota_include_external_storage": false, "trashbin_retention_obligation": "auto, 14", "versions_retention_obligation": "auto, 30", "updater.release.channel": "stable", "mail_smtpmode": "smtp", "mail_smtpauthtype": "LOGIN", "mail_smtpsecure": "tls", "mail_sendmailmode": "smtp", "mail_from_address": "***REMOVED SENSITIVE VALUE***", "mail_domain": "***REMOVED SENSITIVE VALUE***", "mail_smtpauth": 1, "mail_smtphost": "***REMOVED SENSITIVE VALUE***", "mail_smtpport": "587", "mail_smtpname": "***REMOVED SENSITIVE VALUE***", "mail_smtppassword": "***REMOVED SENSITIVE VALUE***", "instanceid": "***REMOVED SENSITIVE VALUE***", "overwritehost": "***REMOVED SENSITIVE VALUE***", "enabledPreviewProviders": [ "OC\\Preview\\PNG", "OC\\Preview\\JPEG", "OC\\Preview\\GIF", "OC\\Preview\\BMP", "OC\\Preview\\XBitmap" ], "loglevel": 3, "default_phone_region": "FR", "tempdirectory": "\/srv\/users\/nextcloud\/tmp", "trusted_proxies": "***REMOVED SENSITIVE VALUE***", "redis": { "host": "***REMOVED SENSITIVE VALUE***", "port": 0 }, "preview_max_memory": 256, "filelocking.enabled": true, } ```

Cron Configuration: Array ( [backgroundjobs_mode] => cron [lastcron] => 1667463004 )

External storages: files_external is disabled

Encryption: no

User-backends:

Talk configuration:

STUN servers

TURN servers

Signaling servers (mode: internal):

Browser: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36

solracsf commented 2 years ago

I've asked someone with Nextcloud to see if he has the same problem, and i'ts confirmed, it affects all instances (here, running NC v24). Of course, he didn't noticed them because Talk is not much used so the problem is minimized, but he has these kind of lookups too.

Here again, all these failed lookups are related to Talk files.

2022.11.03 09:50:33.089694 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001589>
2022.11.03 09:50:33.091269 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001408>
2022.11.03 09:50:33.093168 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001767>
2022.11.03 09:50:33.094554 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001295>
2022.11.03 09:50:33.096858 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.002145>
2022.11.03 09:50:33.098697 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001495>
2022.11.03 09:50:33.100323 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001434>
2022.11.03 09:50:33.101689 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001269>
2022.11.03 09:50:33.103160 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001381>
2022.11.03 09:50:33.104925 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001629>
2022.11.03 09:50:33.106931 [uid:1016,gid:1016,pid:475617] lookup (265684,2201 Home office Jan 23- Feb 02  2022.xlsx): no such file or directory <0.001558>
2022.11.03 09:50:33.108593 [uid:1016,gid:1016,pid:475617] lookup (265684,2201 Home office Jan 23- Feb 02  2022.xlsx): no such file or directory <0.001443>

My first impression here is that Nextcloud is looking up for wrong $path somewhere. Because those files indeed exist on the filesystem.

We can also observe things like:

2022.11.03 10:27:08.291214 [uid:0,gid:0,pid:503906] open (9223372032559808513): OK [fh:70083] <0.000860>
2022.11.03 10:27:09.413262 [uid:1013,gid:1013,pid:503805] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1661975216,1664646143,1664646143,4096]) <0.002136>
2022.11.03 10:27:09.415366 [uid:1013,gid:1013,pid:503805] lookup (1,data): OK (2,[drwxrwx---:0040770,135,1013,1013,1661975384,1666943920,1666943920,4096]) <0.002007>
2022.11.03 10:27:09.417468 [uid:1013,gid:1013,pid:503805] lookup (2,nextcloud.log): OK (405619,[-rw-r-----:0100640,1,1013,1013,1664646103,1667466288,1667466288,6485090]) <0.002038>
2022.11.03 10:27:09.457863 [uid:1013,gid:1013,pid:503805] lookup (2,.ocdata): OK (229,[-rw-rw-r--:0100664,1,1013,1013,1661975385,1643894018,1664640710,0]) <0.002026>

Which of course are expected and OK.

nickvergessen commented 2 years ago

Can you tell me how to check for such lookups? Where is this being logged so I can check if that happens on our 25 instance.

solracsf commented 2 years ago

@nickvergessen you can use https://github.com/rflament/loggedfs

nickvergessen commented 1 year ago

I can't install such tools on our production instance.

However I can't see any changes that could cause something like this in the end. If the problem increased recently or after an update it is mostlikely due to changes in the mount provider logic and not a problem with Talk.

That being said, I think there were improvements in Nextcloud 24 and 25 in that regard. Maybe you can update to those versions and retest it there?

solracsf commented 1 year ago

No problem; once upgraded, we'll report back.

solracsf commented 1 year ago

No improvements on NC 24.0.7.

PVince81 commented 1 year ago

@solracsf can you check if your oc_mounts table has a lot of NULL values for "oc_mounts.mount_provider_class" ? there was an issue related to this that is fixed with the upcoming 24.0.8, but am not sure if it would cause more I/O than necessary

@icewind1991 do you have any insights on the lookups ?

szaimen commented 1 year ago

Hi, please update to 24.0.8 or better 25.0.2 and report back if it fixes the issue. Thank you!

nextcloud-command commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity and seems to be missing some essential information. It will be closed if no further activity occurs. Thank you for your contributions.

meonkeys commented 1 year ago

Did we learn anything related to this issue from when we looked into #35311? (LMK if you want more data for that, btw. The oc_mounts MAX(id) value is still growing rapidly for me.)

szaimen commented 1 year ago

Lets handle this in https://github.com/nextcloud/server/issues/35311 then