nextcloud / server

☁️ Nextcloud server, a safe home for all your data
https://nextcloud.com
GNU Affero General Public License v3.0
27.23k stars 4.05k forks source link

OC_Filecache is blowing Up!!!! #16834

Closed kaoschris closed 5 years ago

kaoschris commented 5 years ago

Steps to reproduce

Import VM into VAPcenter VMware and Setup
Setup an Fileserver (windows) with 1 Million Files
Konfig Nextcloud to use External CIFS Shares from Windows
Import LDAP User from Windows

Expected behaviour

It works like it should but see below Actual behaviour

there are select request hold open for a long time in nextcloud_db Server configuration

Nextcloud server version: (see your admin page) 16.02 Server OS (Ubuntu server is default) Ubuntu 18.04 How did you install the VM? (Scripted install from master OR Released version) From Master Network

Do you use DHCP? Yes Is port 80 and/or 443 open? Yes Logs / Screenshots

https://fileshare.kelobit.de/s/WLfJWbq2B246aNa

kesselb commented 5 years ago

Is there anything in your nextcloud.log? Looks like the cron is running 3 times. Would you mind to add the other information from the issue template like configuration (without credentials) and apps?

MichaIng commented 5 years ago

I am wondering why oc_file_locks and especially oc_storages tables are THAT huge. To compare with mine:

-rw-rw---- 1 mysql mysql  64K Dec  4  2018 oc_file_locks.ibd
-rw-rw---- 1 mysql mysql 220M Aug 24 01:48 oc_filecache.ibd
-rw-rw---- 1 mysql mysql  64K Dec  7  2018 oc_storages.ibd
# mysql -e 'select * from nextcloud.oc_file_locks'
# mysql -e 'select * from nextcloud.oc_storages'
+-------------------------+------------+-----------+--------------+
| id                      | numeric_id | available | last_checked |
+-------------------------+------------+-----------+--------------+
| home::Micha             |          3 |         1 |         NULL |
| local::/mnt/sda/ncdata/ |          4 |         1 |         NULL |
| home::<User2>           |          5 |         1 |         NULL |
+-------------------------+------------+-----------+--------------+

So yeah, the database seems to be stuck/looping, adding entries multiple times.

Besides filling the infos from the default issue template (check postgres logs as well), shutting down webserver and waiting for or killing the running postgres processes, then checking some lines of the oc_storages table could give a hint, e.g.:

select * from nextcloud.oc_storages limit 20;
select * from nextcloud.oc_storages limit 20 offset 100;

And try to replicate with a fresh instance with Redis-based file locking properly enabled first (which we could derive from your config.php settings as well).

enoch85 commented 5 years ago

@MichaIng wrote:

so it seems Nextcloud is not properly configured to use it

It is done by the book. Never tested with 1 million files though.

This is the config in the VM:

  'memcache.local' => '\\OC\\Memcache\\Redis',
  'filelocking.enabled' => true,
  'memcache.distributed' => '\\OC\\Memcache\\Redis',
  'memcache.locking' => '\\OC\\Memcache\\Redis',
  'redis' => 
  array (
    'host' => '/var/run/redis/redis-server.sock',
    'port' => 0,
    'timeout' => 0.5,
    'dbindex' => 0,
    'password' => 'RANDOMPASSWORD',
  ),

And this is the install script: https://github.com/nextcloud/vm/blob/master/static/redis-server-ubuntu.sh

MichaIng commented 5 years ago

@enoch85 Ah didn't see that this is an official Nextcloud VM install. Couldn't find anything wrong in install script/config.

If I remember right, memcache.locking fails silently (not causing any access issues) and then is reverted to database locking automatically, compared to memcache.local, which will cause a 500, if failing.

However logs (issue template info) required to investigate. Additionally, since Redis seems to be involved:

journalctl -u redis-server
kaoschris commented 5 years ago

I am on it to bring you the logs, i need some assistance from our customer to bring you that Informations, thank you in advance for your help.. I will update this asap.

MichaIng commented 5 years ago

@kaoschris If you have no direct access to the affected system (only via customer) then at best take the following logs all together (anonymized of course):

This should be the whole access order, thus all parts that can play a role.

enoch85 commented 5 years ago

@kaoschris I just added APCu to the VM. If you want to use it, please create a file called wtaherver.sh and add this content:

. <(curl -sL https://raw.githubusercontent.com/nextcloud/vm/master/lib.sh)

# Enable igbinary for PHP 
# https://github.com/igbinary/igbinary
if is_this_installed "php$PHPVER"-dev
then
    if ! yes no | pecl install -Z igbinary
    then
        msg_box "igbinary PHP module installation failed"
        exit
    else
        print_text_in_color "$IGreen" "igbinary PHP module installation OK!"
    fi
{
echo "# igbinary for PHP"
echo "extension=igbinary.so"
echo "session.serialize_handler=igbinary"
echo "igbinary.compact_strings=On"
} >> $PHP_INI
restart_webserver
fi

# APCu (local cache)
if is_this_installed "php$PHPVER"-dev
then
    if ! yes no | pecl install -Z apcu
    then
        msg_box "APCu PHP module installation failed"
        exit
    else 
        print_text_in_color "$IGreen" "APCu PHP module installation OK!"
    fi
{
echo "# APCu settings for Nextcloud"
echo "extension=apcu.so"
echo "apc.enabled=1"
echo "apc.shm_segments=1"
echo "apc.shm_size=32M"
echo "apc.entries_hint=4096"
echo "apc.ttl=0"
echo "apc.gc_ttl=3600"
echo "apc.mmap_file_mask=NULL"
echo "apc.slam_defense=1"
echo "apc.enable_cli=1"
echo "apc.use_request_time=1"
echo "apc.serializer=igbinary"
echo "apc.coredump_unmap=0"
echo "apc.preload_path"
} >> $PHP_INI
restart_webserver
fi

Run the file with sudo bash whatever.sh

Then change 'memcache.local' => '\\OC\\Memcache\\Redis', to 'memcache.local' => '\\OC\\Memcache\\APCu', in /var/www/nextcloud/config/config.php

MichaIng commented 5 years ago

@enoch85

Then change 'memcache.local' => '\\OC\\Memcache\\Redis', to 'memcache.local' => '\\OC\\Memcache\\Redis', in /var/www/nextcloud/config/config.php

I guess you mean:

to 'memcache.local' => '\\OC\\Memcache\\APCu',

😉

enoch85 commented 5 years ago

Yes ofc :D

kaoschris commented 5 years ago

Hello i got some logs and uploaded them into https://fileshare.kelobit.de/s/WLfJWbq2B246aNa under the folder "logs" @enoch85 what does this User Cache do? And could it help in that case? And furthermore does it come with your regular update script aswell? Thanks in advance..

kesselb commented 5 years ago

@kaoschris please remove sensitive information from the logs (e.g. account names, paths) ...

MichaIng commented 5 years ago

@kesselb Indeed, wasn't there a method to print the log a bid formatted with sensitive information replaced by **** via occ command?

However I just had a quick view:

"Message":"Failed to connect to the database: An exception occurred in driver: SQLSTATE[08006] [7] FATAL:  the database system is shutting down
"message":"Could not find resource core\/vendor\/marked\/marked.min.js to load"
"message":"unlink(\/mnt\/ncdata\/appdata_*****\/css\/core\/*****-css-variables.css): No such file or directory at \/var\/www\/nextcloud\/lib\/private\/Files\/Storage\/Local.php#227"

Some days earlier (old nextcloud.log.1):

"args":["An exception occurred while executing 'INSERT INTO \"oc_file_locks\" (\"key\", \"lock\", \"ttl\") VALUES(?, ?, ?) ON CONFLICT DO NOTHING' with params [\"files\\\/7ea46b601bdb7f5120e1218085d108f8\", -1, 1564917292]:\n\nSQLSTATE[53100]: Disk full: 7 ERROR:  could not access status of transaction 0\nDETAIL:  Could not write to file \"pg_subtrans\/0052\" at offset 147456: No space left on device.",{"errorInfo":["53100",7,"ERROR:  could not access status of transaction 0\nDETAIL:  Could not write to file \"pg_subtrans\/0052\" at offset 147456: No space left on device."]

And in database, current entries:

2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db ERROR:  duplicate key value violates unique constraint "oc_credentials_pkey"
2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db DETAIL:  Key ("user", identifier)=(B557DE07-CB7A-4F36-BF72-C44429C6F3D6, password::logincredentials/credentials) already exists.
2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db STATEMENT:  INSERT INTO "oc_credentials" ("user", "identifier", "credentials") VALUES($1, $2, $3)
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db ERROR:  deadlock detected
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db DETAIL:  Process 20573 waits for ShareLock on transaction 12499777; blocked by process 33429.
    Process 33429 waits for ShareLock on transaction 12499711; blocked by process 20573.
    Process 20573: INSERT INTO "oc_file_locks" ("key", "lock", "ttl") VALUES($1, $2, $3) ON CONFLICT DO NOTHING
    Process 33429: DELETE FROM "oc_file_locks" WHERE "ttl" < $1
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db HINT:  See server log for query details.
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db CONTEXT:  while inserting index tuple (0,2) in relation "oc_file_locks"
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db STATEMENT:  INSERT INTO "oc_file_locks" ("key", "lock", "ttl") VALUES($1, $2, $3) ON CONFLICT DO NOTHING

Okay so there is clearly much messed up already:

If the above is done, local tests work as expected, I would run some occ maintenance:repair to cleanup database and force some rescans, before opening the instance for general use. Also occ files:scan --all makes sense, since most likely not all files have been successfully added to database.

enoch85 commented 5 years ago

run some occ maintenance:repair

@kaoschris In the VM that's occ_command maintenance:repair

MichaIng commented 5 years ago

@enoch85 Should this issue probably moved to the VM repo? I don't think it is a general VM bug, more particular edge case/support request that might provide ideas for general VM improvements/failsafe setup steps like APCu use to split loads.

kaoschris commented 5 years ago

@kaoschris please remove sensitive information from the logs (e.g. account names, paths) ...

Yes didnt fly over the infos, true that email adresses where in the logs didnt thought it would. thanks for the pointing to it...

The Customer user Windows Server 2016 with File Service SMB and nextcloud connects with external storage app to that shares, the users connect with thier LDAP / Active Directory Credentials to Nextcloud and thier Rights are in use like they should. Means CEOs did have access to alle folders and employees dont. Maybe there is an Conflickt handling massive Data und SMB/CFIS Shares and filecache? The Customer hast about 1.2 Million Files and 70k Folders, its an Media Company wich builds media (web,print, etc.)

enoch85 commented 5 years ago

Customer user Windows Server 2016 with File Service SMB and nextcloud connects with external storage app to that shares, the users connect with thier LDAP / Active Directory Credentials to Nextcloud and thier Rights are in use like they should

I can't find the issue now, but I've seen this before - customers having issues with the combination of LDAP, SMB and PostgreSQL in Nextcloud. Especially when connecting LDAP when using SMB mounted externally. I had one guy like 1 year ago where PostgreSQL created tons of files in the data directory for PostgreSQL.

I will try to find the issue and link it here.

enoch85 commented 5 years ago

Maybe not related after all, but here it is anyway: https://github.com/nextcloud/server/issues/11743

kesselb commented 5 years ago

I suspect that files / folders are shared by the fileserver and configured as user mount are added to oc_filecache for every user.

Example: Fileshare with 100.000 files. User A and User B can access the share with their personal credentials. Technically these are two different storages so 200.000 are added to oc_filecache.

https://github.com/nextcloud/server/blob/adc2ab4bb2b6ab4f860ca1b5030994c6ad55dd59/apps/files_external/lib/Lib/Storage/SMB.php#L135-L140

I'm not 100% sure about that but the code above creates the storage id and contains the username so the same fileshare with another credentials leads to two storages in nextcloud.

This could explain why your oc_filecache table contains a huge amount of data given that number of users and the size of the fileshare.

Nevertheless for such a big instance redis locking will decrease your database load.

"message":"Could not find resource core\/vendor\/marked\/marked.min.js to load"

This is fixed by some app or nextcloud update.

High cpu load / long running selects

https://github.com/nextcloud/server/pull/16432 is part of 16.0.4 and helps to reduce the sql load.

enoch85 commented 5 years ago

@kaoschris

16432 is part of 16.0.4 and helps to reduce the sql load.

16.0.4 was just released, VM wise. But you could use the built in script to update it. sudo bash /var/scripts/update.sh

MichaIng commented 5 years ago

@kesselb

I suspect that files / folders are shared by the fileserver and configured as user mount are added to oc_filecache for every user.

That totally makes sense. How should Nextcloud know that it is the same share if added by multiple users separately.

This actually means:

enoch85 commented 5 years ago

@MichaIng wrote

it would be better if one admin (or group admin) would add those network shares only once, and re-shares them to the other users or a group, Nextcloud internally. Am I right?

That's how I usually recommend doing things.

Makes sense to me at least.

kaoschris commented 5 years ago

Okay ill then update to 16.0.4 and will do some maintenance on the table itself to set some Storage Free and then we rearange the acces rights to Nextcloud level instead of Windows Level.. I'll inform you guys if this helped out...

thanks for helping out ill catch up on that

ghost commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity and seems to be missing some essential information. It will be closed if no further activity occurs. Thank you for your contributions.