OC_Filecache is blowing Up!!!!

kaoschris commented 5 years ago

Steps to reproduce

Import VM into VAPcenter VMware and Setup
Setup an Fileserver (windows) with 1 Million Files
Konfig Nextcloud to use External CIFS Shares from Windows
Import LDAP User from Windows

Expected behaviour

It works like it should but see below Actual behaviour

there are select request hold open for a long time in nextcloud_db Server configuration

Nextcloud server version: (see your admin page) 16.02 Server OS (Ubuntu server is default) Ubuntu 18.04 How did you install the VM? (Scripted install from master OR Released version) From Master Network

Do you use DHCP? Yes Is port 80 and/or 443 open? Yes Logs / Screenshots

https://fileshare.kelobit.de/s/WLfJWbq2B246aNa

kesselb commented 5 years ago

Is there anything in your nextcloud.log? Looks like the cron is running 3 times. Would you mind to add the other information from the issue template like configuration (without credentials) and apps?

MichaIng commented 5 years ago

I am wondering why oc_file_locks and especially oc_storages tables are THAT huge. To compare with mine:

-rw-rw---- 1 mysql mysql  64K Dec  4  2018 oc_file_locks.ibd
-rw-rw---- 1 mysql mysql 220M Aug 24 01:48 oc_filecache.ibd
-rw-rw---- 1 mysql mysql  64K Dec  7  2018 oc_storages.ibd
# mysql -e 'select * from nextcloud.oc_file_locks'
# mysql -e 'select * from nextcloud.oc_storages'
+-------------------------+------------+-----------+--------------+
| id                      | numeric_id | available | last_checked |
+-------------------------+------------+-----------+--------------+
| home::Micha             |          3 |         1 |         NULL |
| local::/mnt/sda/ncdata/ |          4 |         1 |         NULL |
| home::<User2>           |          5 |         1 |         NULL |
+-------------------------+------------+-----------+--------------+

oc_file_locks is empty and unused, which makes sense since I use Redis for file locking. @kaoschris I see you have a Redis process up as well, so it seems Nextcloud is not properly configured to use it. Database file locking causes a huge additional load, so I could imagine that resolving this could already fix your issue.
But what I am completely confused about is the >50M oc_storages table. This should only contain:
- 1 table for local storage
- 1 table for each user
- 1 table for each external storage

So yeah, the database seems to be stuck/looping, adding entries multiple times.

Besides filling the infos from the default issue template (check postgres logs as well), shutting down webserver and waiting for or killing the running postgres processes, then checking some lines of the oc_storages table could give a hint, e.g.:

select * from nextcloud.oc_storages limit 20;
select * from nextcloud.oc_storages limit 20 offset 100;

To get the first 20 entries and the ones 101 - 120, to probably derive a pattern.

And try to replicate with a fresh instance with Redis-based file locking properly enabled first (which we could derive from your config.php settings as well).

enoch85 commented 5 years ago

@MichaIng wrote:

so it seems Nextcloud is not properly configured to use it

It is done by the book. Never tested with 1 million files though.

This is the config in the VM:

  'memcache.local' => '\\OC\\Memcache\\Redis',
  'filelocking.enabled' => true,
  'memcache.distributed' => '\\OC\\Memcache\\Redis',
  'memcache.locking' => '\\OC\\Memcache\\Redis',
  'redis' => 
  array (
    'host' => '/var/run/redis/redis-server.sock',
    'port' => 0,
    'timeout' => 0.5,
    'dbindex' => 0,
    'password' => 'RANDOMPASSWORD',
  ),

And this is the install script: https://github.com/nextcloud/vm/blob/master/static/redis-server-ubuntu.sh

MichaIng commented 5 years ago

@enoch85 Ah didn't see that this is an official Nextcloud VM install. Couldn't find anything wrong in install script/config.

If I remember right, memcache.locking fails silently (not causing any access issues) and then is reverted to database locking automatically, compared to memcache.local, which will cause a 500, if failing.

However logs (issue template info) required to investigate. Additionally, since Redis seems to be involved:

journalctl -u redis-server

Optionally with | tail -30 if output is too much.

kaoschris commented 5 years ago

I am on it to bring you the logs, i need some assistance from our customer to bring you that Informations, thank you in advance for your help.. I will update this asap.

MichaIng commented 5 years ago

@kaoschris If you have no direct access to the affected system (only via customer) then at best take the following logs all together (anonymized of course):

Nextcloud
Webserver
PHP (php-fpm I think?)
Redis
MariaDB/database
dmesg to check if some disk I/O or other system/kernel level errors are related

This should be the whole access order, thus all parts that can play a role.

enoch85 commented 5 years ago

@kaoschris I just added APCu to the VM. If you want to use it, please create a file called wtaherver.sh and add this content:

. <(curl -sL https://raw.githubusercontent.com/nextcloud/vm/master/lib.sh)

# Enable igbinary for PHP 
# https://github.com/igbinary/igbinary
if is_this_installed "php$PHPVER"-dev
then
    if ! yes no | pecl install -Z igbinary
    then
        msg_box "igbinary PHP module installation failed"
        exit
    else
        print_text_in_color "$IGreen" "igbinary PHP module installation OK!"
    fi
{
echo "# igbinary for PHP"
echo "extension=igbinary.so"
echo "session.serialize_handler=igbinary"
echo "igbinary.compact_strings=On"
} >> $PHP_INI
restart_webserver
fi

# APCu (local cache)
if is_this_installed "php$PHPVER"-dev
then
    if ! yes no | pecl install -Z apcu
    then
        msg_box "APCu PHP module installation failed"
        exit
    else 
        print_text_in_color "$IGreen" "APCu PHP module installation OK!"
    fi
{
echo "# APCu settings for Nextcloud"
echo "extension=apcu.so"
echo "apc.enabled=1"
echo "apc.shm_segments=1"
echo "apc.shm_size=32M"
echo "apc.entries_hint=4096"
echo "apc.ttl=0"
echo "apc.gc_ttl=3600"
echo "apc.mmap_file_mask=NULL"
echo "apc.slam_defense=1"
echo "apc.enable_cli=1"
echo "apc.use_request_time=1"
echo "apc.serializer=igbinary"
echo "apc.coredump_unmap=0"
echo "apc.preload_path"
} >> $PHP_INI
restart_webserver
fi

Run the file with sudo bash whatever.sh

Then change 'memcache.local' => '\\OC\\Memcache\\Redis', to 'memcache.local' => '\\OC\\Memcache\\APCu', in /var/www/nextcloud/config/config.php

MichaIng commented 5 years ago

@enoch85

Then change 'memcache.local' => '\\OC\\Memcache\\Redis', to 'memcache.local' => '\\OC\\Memcache\\Redis', in /var/www/nextcloud/config/config.php

I guess you mean:

to 'memcache.local' => '\\OC\\Memcache\\APCu',

😉

enoch85 commented 5 years ago

Yes ofc :D

kaoschris commented 5 years ago

Hello i got some logs and uploaded them into https://fileshare.kelobit.de/s/WLfJWbq2B246aNa under the folder "logs" @enoch85 what does this User Cache do? And could it help in that case? And furthermore does it come with your regular update script aswell? Thanks in advance..

kesselb commented 5 years ago

@kaoschris please remove sensitive information from the logs (e.g. account names, paths) ...

MichaIng commented 5 years ago

@kesselb Indeed, wasn't there a method to print the log a bid formatted with sensitive information replaced by **** via occ command?

However I just had a quick view:

"Message":"Failed to connect to the database: An exception occurred in driver: SQLSTATE[08006] [7] FATAL:  the database system is shutting down
"message":"Could not find resource core\/vendor\/marked\/marked.min.js to load"
"message":"unlink(\/mnt\/ncdata\/appdata_*****\/css\/core\/*****-css-variables.css): No such file or directory at \/var\/www\/nextcloud\/lib\/private\/Files\/Storage\/Local.php#227"

Those kind of messages occur repeatedly in nextcloud.log.
Clearly a database issue.
There are also other entry that access to external storages (SMD and such) failed, but dated some days earlier, so I probably not related.
Redis server reports changes, so it seems to be used, although the changes are suspicious few, probably memcache.local only, but not file locking? 🤔

Some days earlier (old nextcloud.log.1):

"args":["An exception occurred while executing 'INSERT INTO \"oc_file_locks\" (\"key\", \"lock\", \"ttl\") VALUES(?, ?, ?) ON CONFLICT DO NOTHING' with params [\"files\\\/7ea46b601bdb7f5120e1218085d108f8\", -1, 1564917292]:\n\nSQLSTATE[53100]: Disk full: 7 ERROR:  could not access status of transaction 0\nDETAIL:  Could not write to file \"pg_subtrans\/0052\" at offset 147456: No space left on device.",{"errorInfo":["53100",7,"ERROR:  could not access status of transaction 0\nDETAIL:  Could not write to file \"pg_subtrans\/0052\" at offset 147456: No space left on device."]

Database locking is used, which should not be the case if Redis is up. Redis logs do not include this early date.
More importantly, it fails because of missing drive space. I guess this can easily lead to follow up issues.

And in database, current entries:

2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db ERROR:  duplicate key value violates unique constraint "oc_credentials_pkey"
2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db DETAIL:  Key ("user", identifier)=(B557DE07-CB7A-4F36-BF72-C44429C6F3D6, password::logincredentials/credentials) already exists.
2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db STATEMENT:  INSERT INTO "oc_credentials" ("user", "identifier", "credentials") VALUES($1, $2, $3)
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db ERROR:  deadlock detected
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db DETAIL:  Process 20573 waits for ShareLock on transaction 12499777; blocked by process 33429.
    Process 33429 waits for ShareLock on transaction 12499711; blocked by process 20573.
    Process 20573: INSERT INTO "oc_file_locks" ("key", "lock", "ttl") VALUES($1, $2, $3) ON CONFLICT DO NOTHING
    Process 33429: DELETE FROM "oc_file_locks" WHERE "ttl" < $1
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db HINT:  See server log for query details.
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db CONTEXT:  while inserting index tuple (0,2) in relation "oc_file_locks"
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db STATEMENT:  INSERT INTO "oc_file_locks" ("key", "lock", "ttl") VALUES($1, $2, $3) ON CONFLICT DO NOTHING

Also shows database locking is used.
Moreover duplicate database entries are detected. These errors appear on different database tables, including sensitive ones like login credentials as can be seen above.

Okay so there is clearly much messed up already:

First of all the server should be completely stopped at first (webserver, Redis, PostgreSQL) to further investigate and repair.
Assure that all drives which hold database and data have enough free space. Although related log entries are older, probably not an issue anymore.
If possible a database dump/backup should be created, if possible. In case of corruption this is sometimes not possible.
I don't know much about postgres but there should be some way to check & repair tables, AFTER trying the backup/dump.
Redis server state should be verified, although, as current logs show no errors but that it is running and used, this seems to be no issue on Redis side.
Enabling APCu @kaoschris is no must, does not necessarily affect this particular issue, but it's faster for local caching and allows to further check, if Redis is used for file locking as well. E.g. you switch local cache to APCu, do some file transfer/changes and check if redis-server.log contains some current entries like X changes in Y seconds. Saving... with X bring not 0, so being used.
On postgres, while webserver/Nextcloud is down, I would also clear the oc_file_locks, and after doing some file changes, check back if this table stays empty.

If the above is done, local tests work as expected, I would run some occ maintenance:repair to cleanup database and force some rescans, before opening the instance for general use. Also occ files:scan --all makes sense, since most likely not all files have been successfully added to database.

enoch85 commented 5 years ago

run some occ maintenance:repair

@kaoschris In the VM that's occ_command maintenance:repair

MichaIng commented 5 years ago

@enoch85 Should this issue probably moved to the VM repo? I don't think it is a general VM bug, more particular edge case/support request that might provide ideas for general VM improvements/failsafe setup steps like APCu use to split loads.

kaoschris commented 5 years ago

@kaoschris please remove sensitive information from the logs (e.g. account names, paths) ...

Yes didnt fly over the infos, true that email adresses where in the logs didnt thought it would. thanks for the pointing to it...

The Customer user Windows Server 2016 with File Service SMB and nextcloud connects with external storage app to that shares, the users connect with thier LDAP / Active Directory Credentials to Nextcloud and thier Rights are in use like they should. Means CEOs did have access to alle folders and employees dont. Maybe there is an Conflickt handling massive Data und SMB/CFIS Shares and filecache? The Customer hast about 1.2 Million Files and 70k Folders, its an Media Company wich builds media (web,print, etc.)

enoch85 commented 5 years ago

Customer user Windows Server 2016 with File Service SMB and nextcloud connects with external storage app to that shares, the users connect with thier LDAP / Active Directory Credentials to Nextcloud and thier Rights are in use like they should

I can't find the issue now, but I've seen this before - customers having issues with the combination of LDAP, SMB and PostgreSQL in Nextcloud. Especially when connecting LDAP when using SMB mounted externally. I had one guy like 1 year ago where PostgreSQL created tons of files in the data directory for PostgreSQL.

I will try to find the issue and link it here.

enoch85 commented 5 years ago

Maybe not related after all, but here it is anyway: https://github.com/nextcloud/server/issues/11743

kesselb commented 5 years ago

I suspect that files / folders are shared by the fileserver and configured as user mount are added to oc_filecache for every user.

Example: Fileshare with 100.000 files. User A and User B can access the share with their personal credentials. Technically these are two different storages so 200.000 are added to oc_filecache.

https://github.com/nextcloud/server/blob/adc2ab4bb2b6ab4f860ca1b5030994c6ad55dd59/apps/files_external/lib/Lib/Storage/SMB.php#L135-L140

I'm not 100% sure about that but the code above creates the storage id and contains the username so the same fileshare with another credentials leads to two storages in nextcloud.

This could explain why your oc_filecache table contains a huge amount of data given that number of users and the size of the fileshare.

Nevertheless for such a big instance redis locking will decrease your database load.

"message":"Could not find resource core\/vendor\/marked\/marked.min.js to load"

This is fixed by some app or nextcloud update.

High cpu load / long running selects

https://github.com/nextcloud/server/pull/16432 is part of 16.0.4 and helps to reduce the sql load.

enoch85 commented 5 years ago

@kaoschris

16432 is part of 16.0.4 and helps to reduce the sql load.

16.0.4 was just released, VM wise. But you could use the built in script to update it. sudo bash /var/scripts/update.sh

MichaIng commented 5 years ago

@kesselb

I suspect that files / folders are shared by the fileserver and configured as user mount are added to oc_filecache for every user.

That totally makes sense. How should Nextcloud know that it is the same share if added by multiple users separately.

This actually means:

The issue is most likely not that bad after all, since huge oc_filecache is expected and not due to some database "errors".
Especially if the Nextcloud instance otherwise still works well, besides the database overload symptoms reported and from logs, things (cache and file locking) can be tuned on the fly. During some closing time/calm night hours, a postgres database check and clearing of oc_file_locks when Redis took over, still makes sense.
This means that such a setup, a large number of users adding the same network share, is not optimal. In terms of database load, size and e.g. recent files scan jobs, it would be better if one admin (or group admin) would add those network shares only once, and re-shares them to the other users or a group, Nextcloud internally. Am I right?

enoch85 commented 5 years ago

@MichaIng wrote

it would be better if one admin (or group admin) would add those network shares only once, and re-shares them to the other users or a group, Nextcloud internally. Am I right?

That's how I usually recommend doing things.

Makes sense to me at least.

kaoschris commented 5 years ago

Okay ill then update to 16.0.4 and will do some maintenance on the table itself to set some Storage Free and then we rearange the acces rights to Nextcloud level instead of Windows Level.. I'll inform you guys if this helped out...

thanks for helping out ill catch up on that

ghost commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity and seems to be missing some essential information. It will be closed if no further activity occurs. Thank you for your contributions.

nextcloud / server

OC_Filecache is blowing Up!!!! #16834

16432 is part of 16.0.4 and helps to reduce the sql load.