Closed solracsf closed 1 year ago
Can you attach a system report? Especially storage information would be useful.
Sure, what do you call a system report exactly?
The support app can generate it. Otherwise describe your set up a bit, are you using external store? Which other apps are enabled? Do you have Redis/Memcache? Are you also using it for Locking?
Is there anything in your nextcloud.log file that indicates an error or results in data not being written or something?
My setup is pretty classic; Nextcloud 23.0.10, no external storage at all. This is a small dedicated server, 4 Core CPU @ 3Ghz, 16Gib RAM, entirely dedicated to Nextcloud. Nextcloud is using a local Redis to both cache and locking, MariaDB as DB, PHP-FPM with Opcache, and Apache 2.4. Vanilla install from docs. 12 users on the instance.
Log is full of https://github.com/nextcloud/server/issues/33919 but nothing else is logged, errors or warnings of any kind.
Just to take an idea, here is the relationship between consumed CPU by these insane lookups
.
Talk is by far the mostly used app on the server. While these lookups are going on, the Apache webserver logs these requests only:
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
GET "/ocs/v2.php/apps/spreed/api/v1/chat/3coj2477?setReadMarker=0&lookIntoFuture=1&lastKnownMessageId=40258&includeLastKnown=0"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
GET "/ocs/v2.php/apps/spreed/api/v1/chat/39dwbny4?setReadMarker=0&lookIntoFuture=1&lastKnownMessageId=40206&includeLastKnown=0"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
GET "/ocs/v2.php/apps/spreed/api/v1/chat/39dwbny4?setReadMarker=0&lookIntoFuture=1&lastKnownMessageId=40206&includeLastKnown=0"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
PUT "/apps/user_status/heartbeat"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
GET "/ocs/v2.php/apps/notifications/api/v2/notifications"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
GET "/ocs/v2.php/apps/spreed/api/v4/room?includeStatus=true"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
POST "/ocs/v2.php/apps/spreed/api/v3/signaling/backend"
The PID on the lookups screenshot always corresponds to a php-fpm
process.
Report here below.
Operating system: Linux 5.15.0-52-generic nextcloud/spreed#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64
Webserver: Apache
Database: MariaDB 10.6
PHP version: 8.0.25
Modules loaded: Core, date, libxml, openssl, pcre, zlib, filter, hash, json, Reflection, SPL, session, standard, sodium, cgi-fcgi, mysqlnd, PDO, xml, bcmath, calendar, ctype, curl, dom, mbstring, FFI, fileinfo, ftp, gd, gettext, gmp, iconv, igbinary, imagick, intl, ldap, exif, msgpack, mysqli, pdo_mysql, pdo_sqlite, Phar, posix, readline, redis, shmop, SimpleXML, soap, sockets, sqlite3, sysvmsg, sysvsem, sysvshm, tokenizer, xmlreader, xmlwriter, xsl, zip, Zend OPcache
Nextcloud version: 23.0.10 - 23.0.10.1
Updated from an older Nextcloud/ownCloud or fresh install:
Where did you install Nextcloud from: unknown
Cron Configuration: Array ( [backgroundjobs_mode] => cron [lastcron] => 1667463004 )
External storages: files_external is disabled
Encryption: no
User-backends:
Talk configuration:
STUN servers
TURN servers
Signaling servers (mode: internal):
Browser: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36
I've asked someone with Nextcloud to see if he has the same problem, and i'ts confirmed, it affects all instances (here, running NC v24). Of course, he didn't noticed them because Talk is not much used so the problem is minimized, but he has these kind of lookups too.
Here again, all these failed lookups are related to Talk files.
2022.11.03 09:50:33.089694 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001589>
2022.11.03 09:50:33.091269 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001408>
2022.11.03 09:50:33.093168 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001767>
2022.11.03 09:50:33.094554 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.001295>
2022.11.03 09:50:33.096858 [uid:1016,gid:1016,pid:475617] lookup (265684,Trainig_Julian Session 1 and 2.mp4): no such file or directory <0.002145>
2022.11.03 09:50:33.098697 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001495>
2022.11.03 09:50:33.100323 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001434>
2022.11.03 09:50:33.101689 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001269>
2022.11.03 09:50:33.103160 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001381>
2022.11.03 09:50:33.104925 [uid:1016,gid:1016,pid:475617] lookup (265684,IMPRIMANTE): no such file or directory <0.001629>
2022.11.03 09:50:33.106931 [uid:1016,gid:1016,pid:475617] lookup (265684,2201 Home office Jan 23- Feb 02 2022.xlsx): no such file or directory <0.001558>
2022.11.03 09:50:33.108593 [uid:1016,gid:1016,pid:475617] lookup (265684,2201 Home office Jan 23- Feb 02 2022.xlsx): no such file or directory <0.001443>
My first impression here is that Nextcloud is looking up for wrong $path
somewhere. Because those files indeed exist on the filesystem.
We can also observe things like:
2022.11.03 10:27:08.291214 [uid:0,gid:0,pid:503906] open (9223372032559808513): OK [fh:70083] <0.000860>
2022.11.03 10:27:09.413262 [uid:1013,gid:1013,pid:503805] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1661975216,1664646143,1664646143,4096]) <0.002136>
2022.11.03 10:27:09.415366 [uid:1013,gid:1013,pid:503805] lookup (1,data): OK (2,[drwxrwx---:0040770,135,1013,1013,1661975384,1666943920,1666943920,4096]) <0.002007>
2022.11.03 10:27:09.417468 [uid:1013,gid:1013,pid:503805] lookup (2,nextcloud.log): OK (405619,[-rw-r-----:0100640,1,1013,1013,1664646103,1667466288,1667466288,6485090]) <0.002038>
2022.11.03 10:27:09.457863 [uid:1013,gid:1013,pid:503805] lookup (2,.ocdata): OK (229,[-rw-rw-r--:0100664,1,1013,1013,1661975385,1643894018,1664640710,0]) <0.002026>
Which of course are expected and OK.
Can you tell me how to check for such lookups? Where is this being logged so I can check if that happens on our 25 instance.
@nickvergessen you can use https://github.com/rflament/loggedfs
I can't install such tools on our production instance.
However I can't see any changes that could cause something like this in the end. If the problem increased recently or after an update it is mostlikely due to changes in the mount provider logic and not a problem with Talk.
That being said, I think there were improvements in Nextcloud 24 and 25 in that regard. Maybe you can update to those versions and retest it there?
No problem; once upgraded, we'll report back.
No improvements on NC 24.0.7.
@solracsf can you check if your oc_mounts table has a lot of NULL values for "oc_mounts.mount_provider_class" ? there was an issue related to this that is fixed with the upcoming 24.0.8, but am not sure if it would cause more I/O than necessary
@icewind1991 do you have any insights on the lookups ?
Hi, please update to 24.0.8 or better 25.0.2 and report back if it fixes the issue. Thank you!
This issue has been automatically marked as stale because it has not had recent activity and seems to be missing some essential information. It will be closed if no further activity occurs. Thank you for your contributions.
Did we learn anything related to this issue from when we looked into #35311? (LMK if you want more data for that, btw. The oc_mounts MAX(id) value is still growing rapidly for me.)
Lets handle this in https://github.com/nextcloud/server/issues/35311 then
How to use GitHub
Steps to reproduce
I don't really know. Server was having, since a while, a grow in IO operations. Today, we've debugged it. There are thousands of disk
lookup
operations per second undergoing, related to Talk files (files shared in Talk rooms only, they're not present anywhere else, we know it by their filenames). All theselookup
operations are failing. Small extract of not even one entire second:Confirmed by searching these files on DB, example here (on
oc_share
table):And on NC interface:
If it helps, one simple way we've found to fire these
lookup
calls (not the only one, as there are millions of these operations per day, 12 users on server, and only one admin can access this page) is browsing to/settings/users
page.You've guessed it; disabling Talk app disables all these calls.
We've run
files:scan
files:scan-app-data
files:cleanup
no help here.Expected behaviour
This should not happen
Talk app
Talk app version: 13.0.9
Custom Signaling server configured: yes
Custom TURN server configured: yes
Custom STUN server configured: yes
Operating system: Ubuntu 22.04
Web server: Apache
Database: MariaDB
PHP version: 8.0
Nextcloud Version: 23.0.10