Closed davehayes closed 1 year ago
Hi @davehayes - This probably something more suited for the Nextcloud Help Forum, at least unless it starts pointing at a bug.
Then:
Lastly:
php -v
against known php bugs/updates availableend_to_end_encryption
and it's active; are you certain it's not being used?Hi @davehayes - This probably something more suited for the Nextcloud Help Forum, at least unless it starts pointing at a bug.
Well, I was pretty sure this was a bug. :) It's just not clear in which software (PHP, nextcloud,etc.)
There's a couple of reasons for me to post here; however please let me know if this is inappropriate and I will go away.
One reason is that those help forums do not have very technical people, so I am mostly reluctant to post there as it appears to be to be a waste of effort when things get this technical.
The other reason is that I did search that site for this issue and I got a number of conflicting reports and different issues; none of these applied to my use case.
- Do the the processes eventually finish on their own after saturating the CPU?
Never. That is how I noticed it, watching the load average go up over time.
- Is it occurring every 10 minutes (i.e. every run) or more like once a day?
I've found absolutely no data for this query. Sometimes it starts right away, other times it takes 8-10 hours before I see one.
- Any warnings under the Administration page?
Only the "last background job execution ran yesterday" one. This does not occur when I let the cronjobs run, as some actually succeed.
- What does your db server load look like during these events?
Looking at my prometheus stats, there is zero database load during the spin lock, which I believe should be relatively obvious from the yield idea. There are some brief spikes of 100op/sec activity but nothing correlatable as those spikes exist without the cron jobs running at all (or any spin locked processes)
- What does your disk I/O load look like during these events?
Very quiet. Note that I am running nextcloud on a ZFS server with an SSD ZIL, so the performance may obscure any relevant cluse.
- Any recurring errors or warnings or anything in the logs in general (i.e. not just during the events though anything during obviously of interest too)?
The only remotely interesting thing is this:
Aug 22 20:33:17 postgres[24582]: [6-1] 2023-08-22 20:33:17.258 PDT [95055] ERROR: duplicate key value violates unique constraint "oc_filecache_extended_pkey"
Aug 22 20:33:17 postgres[24582]: [6-2] 2023-08-22 20:33:17.258 PDT [95055] DETAIL: Key (fileid)=(42197) already exists.
Aug 22 20:33:17 postgres[24582]: [6-3] 2023-08-22 20:33:17.258 PDT [95055] STATEMENT: INSERT INTO "oc_filecache_extended" ("fileid", "upload_time", "creation_time") VALUES($1, $2, $3)
- Check the output of
php -v
against known php bugs/updates available
We are running 8.1.20 here. None of the changes since then look relevant
- Can you provide a screenshot from something like htop or similar to get an idea of the bigger picture during these events?
I cannot as I do not have htop on FreeBSD. I do have extensive prometheus monitoring; what part of the bigger picture would you like?
- Looks like you installed
end_to_end_encryption
and it's active; are you certain it's not being used?
This is turned off for my account. I have that turned off because it seemed finicky, but I would like to use it if possible. First things first. ;)
At the current moment I am the only one using nextcloud because I will eventually have regular users to support on it, so I can guarantee it's not on for me and my use cases. I have nextcloud hard restricted to my LAN, which should rule out any external mischief. Since I do extremely intense monitoring on most of the servers I maintain, now is a great time to ask for any in-depth monitoring that I could provide in order to help track this issue down.
The one blocker to releasing to users is that this is not the first time I've encountered this issue. It's been going on for some months, ever since nextcloud 25 and the first install. While I can write a reaper to detect and kill these processes, I would prefer to solve this issue correctly and/or get to the bottom of it.
Thanks in advance for any assistance you can provide.
You can get more details about what background jobs cron.php
is running within NC if you set your NC logging to debug (i.e. 'loglevel' => 0
in your config.php
), It'll include each job's start/finish (and in-job debugging too in some cases).
Maybe this will help intersect which job(s) seem to be triggering this condition more than others. (Well and, I suppose, if it's even limited to just one). Note not all configured background jobs run each time cron.php
calls, so it may take up to a 24 hour cycle to see the system run through all (well, most) potential jobs.
Just beware the usual caveats inherent in having debug level logging on - particularly if there's any other activity in NC at the time. :smile:
If 24 hours of debug logging isn't practical... the main alternative I can think of which most closely would replicate it would be to... toggle your cron.php
crontab (or equivalent) entry off and then attempt an ./occ background-job:list
followed by ./occ background-job:execute X
(where X
is one of your listed background jobs - so that you run them each manually one-by-one until something... maybe happens). This won't require debug logging to be left on for 24 hours, but it may not catch some potential scenarios (i.e. any possible side effects of parallel cron.php
executions). It'll also require you to execute each of the jobs numerous times over a 24 hour period in order to get a reasonable sample size.
Let's also try to eliminate some of the easy stuff first:
--define apc.enable_cli=1
to your cron.php
command-line: e.g. /usr/local/bin/php --define apc.enable_cli=1 /nextcloud/cron.php
(or make sure it's in your php.ini
- just make sure it's the CLI-mode one, not apache/fpm) opcache.enable_cli=0
in your CLI-mode php.ini
[probably default, but just in case its enabled for some reason]Is PHP compiled with any weird stuff/options or have any non-standard/non-NC required PHP extensions active? (e.g. php -i | grep Configure
, php -m
, or similar)
P.S. This looks interesting: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266873
P.P.S. Are you sure you're not getting at least a The database is used for transactional file locking [...]
message on your Admin page? With your provided config I'd expect to see it due to your lack of Redis use (https://docs.nextcloud.com/server/latest/admin_manual/configuration_files/files_locking_transactional.html)
Just beware the usual caveats inherent in having debug level logging on - particularly if there's any other activity in NC at the time. 😄
Can you be more specific here?
- add
--define apc.enable_cli=1
to yourcron.php
command-line: e.g./usr/local/bin/php --define apc.enable_cli=1 /nextcloud/cron.php
(or make sure it's in yourphp.ini
- just make sure it's the CLI-mode one, not apache/fpm)
I believe I was required to add this by occ
? Nevertheless, my php.ini has this on.
- Disable PHP's opcache explicitly for the CLI: The easiest way is probably to set
opcache.enable_cli=0
in your CLI-modephp.ini
[probably default, but just in case its enabled for some reason]
Now I do not have a CLI mode php.ini. I presume the --define
can let me do that just for the sake of a command?
It makes me a bit nervous to do testing with one config and live with another. Is this a recommendation for all of nextcloud's CLI ideas (e.g. occ
), or just a testing idea?
Is PHP compiled with any weird stuff/options or have any non-standard/non-NC required PHP extensions active? (e.g.
php -i | grep Configure
,php -m
, or similar)
Probably. :)
I build many FreeBSD packages myself for many different sites, almost on a daily basis. The PHP build I use is live for several different instances (e.g. laravel, wordpress) and has requirements I have adhered to. We have never seen an issue like this over a fair number of live servers and many hits per hour. This does not mean there isn't one either, I'm just saying that these builds are fairly well battle tested.
Now having said this, I went and looked here:
P.S. This looks interesting: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266873
Thank you for this! This looks like the low hanging fruit to handle first, as this is actually the easiest thing for me to try. I've seen issues with OpenMP before. I will try a new build.
If you want feedback about that then please keep this issue open (then others who google here can have a solution).
P.P.S. Are you sure you're not getting at least a
The database is used for transactional file locking [...]
message on your Admin page? With your provided config I'd expect to see it due to your lack of Redis use (https://docs.nextcloud.com/server/latest/admin_manual/configuration_files/files_locking_transactional.html)
No messages like that ever. I do like to keep it simple so I did not install redis, and with the handful of users I have here at the test site I am under the impression that redis will be overkill.
Note that I will consider this if I deploy to a larger site.
Can you be more specific here?
Debug logging, since it's global, will cause all sorts of things to log. It'll just be really noisy. It's not debug mode though. Shouldn't be too bad on any reasonably modern system used by one person. Just beware of the noise.
I believe I was required to add this by
occ
? Nevertheless, my php.ini has this on.
Oh yeah I forgot we do trigger an exception at the command-line nowadays if APCu is not already enabled.
- Disable PHP's opcache explicitly for the CLI: The easiest way is probably to set
opcache.enable_cli=0
in your CLI-modephp.ini
[probably default, but just in case its enabled for some reason]Now I do not have a CLI mode php.ini. I presume the
--define
can let me do that just for the sake of a command?
Yes,--define
should work for that parameter as well. Probably should have suggested that instead. :smile:
It makes me a bit nervous to do testing with one config and live with another. Is this a recommendation for all of nextcloud's CLI ideas (e.g.
occ
), or just a testing idea?
The opcache one is just a testing idea.
Is PHP compiled with any weird stuff/options or have any non-standard/non-NC required PHP extensions active? (e.g.
php -i | grep Configure
,php -m
, or similar)Probably. :)
That could be a factor for why this isn't readily reproducible (maybe).
Now having said this, I went and looked here:
P.S. This looks interesting: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266873
Thank you for this! This looks like the low hanging fruit to handle first, as this is actually the easiest thing for me to try. I've seen issues with OpenMP before. I will try a new build.
:+1:
Btw, I have no idea where pmdk
/ libmemkind
stuff you're seeing is coming from. I don't see it searched for here in my Debian 12 environments.
Oh, and a note on the previously suggested occ background-job:list
to look at all your jobs: append -l 100
to the command-line. Otherwise it limits to the first 10 jobs found so it'll be an incomplete list.
Well, it appears that was it. Over 242 cron jobs in the last day and absolutely zero hangs. :D
The issue was related to the ports system. It seems that it wasn't respecting my options due to some confusion over the flavor. Bottom line is, the OpenMP option to ImageMagick7 should be unticked in FreeBSD to prevent this from occurring. When I did that and reinstalled the library, the problem has gone away.
The pmdk / libmemkind looks to be from OpenMP so far. I haven't traced completely.
I will close this as I consider it resolved. Thank you very much for your assistance.
⚠️ This issue respects the following points: ⚠️
Bug description
I run nextcloud cron jobs out of crontab. After a while, random cron jobs exhibit 100% cpu behavior.
Naturally I wanted to get to the bottom of this, so I ran the following script wrapper:
This eventually produced kernel trace files that shows, at a low level, where the problem is, however without a detailed knowledge of your code I cannot proceed. Thus I'm filing this report so I can hopefully reach someone who might have more of a clue.
The commonality for all the files is, each one seems to first search for a shared library named
libmemkind
. Being on FreeBSD, we don't have this library; a cursory investigation reveals that this might be part ofpmdk
which deprecates FreeBSD. This searches a lot of unrelated places for this shared library, which is unusual.Next it tries to open shared memory like this:
I'm not sure where this is coming from. Very soon after this, we get to the symptom:
This appears to be waiting for something, but does so in a manner that eats 100% cpu.
Thanks in advance for a cogent reply to this issue.
Steps to reproduce
cron.php
periodicallyAlternatively
Expected behavior
I expect to never find a cron process that doesn't exit and spins the cpu up to 100%
Installation method
Community Manual installation with Archive
Nextcloud Server version
26
Operating system
Other
PHP engine version
PHP 8.1
Web server
Nginx
Database engine version
PostgreSQL
Is this bug present after an update or on a fresh install?
Upgraded to a MAJOR version (ex. 22 to 23)
Are you using the Nextcloud Server Encryption module?
Encryption is Disabled
What user-backends are you using?
Configuration report
List of activated Apps
Nextcloud Signing status
Nextcloud Logs
Additional info
OS: FreeBSD 12.4-STABLE