nextcloud / server

☁️ Nextcloud server, a safe home for all your data
https://nextcloud.com
GNU Affero General Public License v3.0
27.39k stars 4.07k forks source link

[Bug]: Cronjob infinite yield (100% CPU) #40022

Closed davehayes closed 1 year ago

davehayes commented 1 year ago

⚠️ This issue respects the following points: ⚠️

Bug description

I run nextcloud cron jobs out of crontab. After a while, random cron jobs exhibit 100% cpu behavior.

Naturally I wanted to get to the bottom of this, so I ran the following script wrapper:

#!/bin/sh
exec >>/var/log/nextcloud/nextcrontrace.log 2>&1
KTFILE="/var/tmp/kt/nkt.`date +%Y%m%d%H%M%S`.$$"
echo "### >>> Cron started ... trace file is $KTFILE"
ktrace -f $KTFILE -i /usr/local/bin/php /nextcloud/cron.php
echo "### <<< Cron finished ($KTFILE)"

This eventually produced kernel trace files that shows, at a low level, where the problem is, however without a detailed knowledge of your code I cannot proceed. Thus I'm filing this report so I can hopefully reach someone who might have more of a clue.

The commonality for all the files is, each one seems to first search for a shared library named libmemkind. Being on FreeBSD, we don't have this library; a cursory investigation reveals that this might be part of pmdk which deprecates FreeBSD. This searches a lot of unrelated places for this shared library, which is unusual.

Next it tries to open shared memory like this:

  2760 php      CALL  shm_open(0x80665ac00,0xa02<O_RDWR|O_CREAT|O_EXCL>,0666<S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH>)
  2760 php      NAMI  "/__KMP_REGISTERED_LIB_2760_666"
  2760 php      RET   shm_open -1 errno 17 File exists
  2760 php      CALL  shm_open(0x80665ac00,0x2<O_RDWR>,0666<S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH>)
  2760 php      NAMI  "/__KMP_REGISTERED_LIB_2760_666"
  2760 php      RET   shm_open 3

I'm not sure where this is coming from. Very soon after this, we get to the symptom:

  2760 php      CALL  sched_yield
  2760 php      RET   sched_yield 0
  2760 php      CALL  sched_yield
  2760 php      RET   sched_yield 0

This appears to be waiting for something, but does so in a manner that eats 100% cpu.

Thanks in advance for a cogent reply to this issue.

Steps to reproduce

  1. Run cron.php periodically
  2. Watch your process table with your favorite tool (I use top)
  3. Wait until you see 100% cpu processes

Alternatively

  1. Create a script similar to the above script
  2. Run that out of crontab periodically
  3. Wait until you see 100% cpu processes

Expected behavior

I expect to never find a cron process that doesn't exit and spins the cpu up to 100%

Installation method

Community Manual installation with Archive

Nextcloud Server version

26

Operating system

Other

PHP engine version

PHP 8.1

Web server

Nginx

Database engine version

PostgreSQL

Is this bug present after an update or on a fresh install?

Upgraded to a MAJOR version (ex. 22 to 23)

Are you using the Nextcloud Server Encryption module?

Encryption is Disabled

What user-backends are you using?

Configuration report

# occ config:list system
{
    "system": {
        "apps_paths": [
            {
                "path": "\/usr\/local\/www\/nextcloud\/apps",
                "url": "\/apps",
                "writable": true
            }
        ],
        "logfile": "\/var\/log\/nextcloud\/nextcloud.log",
        "memcache.local": "\\OC\\Memcache\\APCu",
        "instanceid": "***REMOVED SENSITIVE VALUE***",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": [
            "cloud.dream-tech.com"
        ],
        "datadirectory": "***REMOVED SENSITIVE VALUE***",
        "dbtype": "pgsql",
        "version": "26.0.3.2",
        "overwrite.cli.url": "https:\/\/cloud.dream-tech.com",
        "dbname": "***REMOVED SENSITIVE VALUE***",
        "dbhost": "***REMOVED SENSITIVE VALUE***",
        "dbport": "",
        "dbtableprefix": "oc_",
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "mail_from_address": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpmode": "smtp",
        "mail_sendmailmode": "smtp",
        "mail_domain": "***REMOVED SENSITIVE VALUE***",
        "mail_smtphost": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpport": "25",
        "default_phone_region": "US",
        "maintenance": false,
        "theme": "",
        "loglevel": 2,
        "updater.release.channel": "stable",
        "has_rebuilt_cache": true
    }
}

List of activated Apps

Enabled:
  - activity: 2.18.0
  - calendar: 4.4.2
  - carnet: 0.24.7
  - circles: 26.0.0
  - cloud_federation_api: 1.9.0
  - comments: 1.16.0
  - contacts: 5.3.2
  - contactsinteraction: 1.7.0
  - cookbook: 0.10.2
  - dashboard: 7.6.0
  - dav: 1.25.0
  - deck: 1.9.2
  - drawio: 2.1.1
  - end_to_end_encryption: 1.12.5
  - external: 5.1.0
  - federatedfilesharing: 1.16.0
  - federation: 1.16.0
  - files: 1.21.1
  - files_external: 1.18.0
  - files_mindmap: 0.0.28
  - files_pdfviewer: 2.7.0
  - files_rightclick: 1.5.0
  - files_sharing: 1.18.0
  - files_trashbin: 1.16.0
  - files_versions: 1.19.1
  - firstrunwizard: 2.15.0
  - logreader: 2.11.0
  - lookup_server_connector: 1.14.0
  - mail: 3.2.2
  - nextcloud_announcements: 1.15.0
  - notes: 4.8.0
  - notifications: 2.14.0
  - oauth2: 1.14.0
  - password_policy: 1.16.0
  - photos: 2.2.0
  - privacy: 1.10.0
  - provisioning_api: 1.16.0
  - recommendations: 1.5.0
  - related_resources: 1.1.0-alpha1
  - richdocuments: 8.0.2
  - serverinfo: 1.16.0
  - settings: 1.8.0
  - sharebymail: 1.16.0
  - spreed: 16.0.4
  - support: 1.9.0
  - survey_client: 1.14.0
  - systemtags: 1.16.0
  - tasks: 0.15.0
  - text: 3.7.2
  - theming: 2.1.1
  - transfer: 0.6.1
  - twofactor_backupcodes: 1.15.0
  - updatenotification: 1.16.0
  - user_status: 1.6.0
  - viewer: 1.10.0
  - weather_status: 1.6.0
  - workflowengine: 2.8.0
Disabled:
  - admin_audit: 1.16.0
  - analytics: 4.9.4 (installed 4.9.4)
  - bruteforcesettings: 2.6.0
  - encryption: 2.14.0
  - externalportal: 1.2.0 (installed 1.2.0)
  - groupfolders: 14.0.2 (installed 14.0.2)
  - suspicious_login: 4.4.0
  - twofactor_totp: 8.0

Nextcloud Signing status

No errors have been found.

Nextcloud Logs

No relevant logs were found

Additional info

OS: FreeBSD 12.4-STABLE

joshtrichards commented 1 year ago

Hi @davehayes - This probably something more suited for the Nextcloud Help Forum, at least unless it starts pointing at a bug.

Then:

Lastly:

davehayes commented 1 year ago

Hi @davehayes - This probably something more suited for the Nextcloud Help Forum, at least unless it starts pointing at a bug.

Well, I was pretty sure this was a bug. :) It's just not clear in which software (PHP, nextcloud,etc.)

There's a couple of reasons for me to post here; however please let me know if this is inappropriate and I will go away.

One reason is that those help forums do not have very technical people, so I am mostly reluctant to post there as it appears to be to be a waste of effort when things get this technical.

The other reason is that I did search that site for this issue and I got a number of conflicting reports and different issues; none of these applied to my use case.

  • Do the the processes eventually finish on their own after saturating the CPU?

Never. That is how I noticed it, watching the load average go up over time.

  • Is it occurring every 10 minutes (i.e. every run) or more like once a day?

I've found absolutely no data for this query. Sometimes it starts right away, other times it takes 8-10 hours before I see one.

  • Any warnings under the Administration page?

Only the "last background job execution ran yesterday" one. This does not occur when I let the cronjobs run, as some actually succeed.

  • What does your db server load look like during these events?

Looking at my prometheus stats, there is zero database load during the spin lock, which I believe should be relatively obvious from the yield idea. There are some brief spikes of 100op/sec activity but nothing correlatable as those spikes exist without the cron jobs running at all (or any spin locked processes)

  • What does your disk I/O load look like during these events?

Very quiet. Note that I am running nextcloud on a ZFS server with an SSD ZIL, so the performance may obscure any relevant cluse.

  • Any recurring errors or warnings or anything in the logs in general (i.e. not just during the events though anything during obviously of interest too)?

The only remotely interesting thing is this:

Aug 22 20:33:17 postgres[24582]: [6-1] 2023-08-22 20:33:17.258 PDT [95055] ERROR:  duplicate key value violates unique constraint "oc_filecache_extended_pkey"
Aug 22 20:33:17 postgres[24582]: [6-2] 2023-08-22 20:33:17.258 PDT [95055] DETAIL:  Key (fileid)=(42197) already exists.
Aug 22 20:33:17  postgres[24582]: [6-3] 2023-08-22 20:33:17.258 PDT [95055] STATEMENT:  INSERT INTO "oc_filecache_extended" ("fileid", "upload_time", "creation_time") VALUES($1, $2, $3)
  • Check the output of php -v against known php bugs/updates available

We are running 8.1.20 here. None of the changes since then look relevant

  • Can you provide a screenshot from something like htop or similar to get an idea of the bigger picture during these events?

I cannot as I do not have htop on FreeBSD. I do have extensive prometheus monitoring; what part of the bigger picture would you like?

  • Looks like you installed end_to_end_encryption and it's active; are you certain it's not being used?

This is turned off for my account. I have that turned off because it seemed finicky, but I would like to use it if possible. First things first. ;)

At the current moment I am the only one using nextcloud because I will eventually have regular users to support on it, so I can guarantee it's not on for me and my use cases. I have nextcloud hard restricted to my LAN, which should rule out any external mischief. Since I do extremely intense monitoring on most of the servers I maintain, now is a great time to ask for any in-depth monitoring that I could provide in order to help track this issue down.

The one blocker to releasing to users is that this is not the first time I've encountered this issue. It's been going on for some months, ever since nextcloud 25 and the first install. While I can write a reaper to detect and kill these processes, I would prefer to solve this issue correctly and/or get to the bottom of it.

Thanks in advance for any assistance you can provide.

joshtrichards commented 1 year ago

You can get more details about what background jobs cron.php is running within NC if you set your NC logging to debug (i.e. 'loglevel' => 0 in your config.php), It'll include each job's start/finish (and in-job debugging too in some cases).

Maybe this will help intersect which job(s) seem to be triggering this condition more than others. (Well and, I suppose, if it's even limited to just one). Note not all configured background jobs run each time cron.php calls, so it may take up to a 24 hour cycle to see the system run through all (well, most) potential jobs.

Just beware the usual caveats inherent in having debug level logging on - particularly if there's any other activity in NC at the time. :smile:

If 24 hours of debug logging isn't practical... the main alternative I can think of which most closely would replicate it would be to... toggle your cron.php crontab (or equivalent) entry off and then attempt an ./occ background-job:list followed by ./occ background-job:execute X (where X is one of your listed background jobs - so that you run them each manually one-by-one until something... maybe happens). This won't require debug logging to be left on for 24 hours, but it may not catch some potential scenarios (i.e. any possible side effects of parallel cron.php executions). It'll also require you to execute each of the jobs numerous times over a 24 hour period in order to get a reasonable sample size.

Let's also try to eliminate some of the easy stuff first:

Is PHP compiled with any weird stuff/options or have any non-standard/non-NC required PHP extensions active? (e.g. php -i | grep Configure, php -m, or similar)

P.S. This looks interesting: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266873

P.P.S. Are you sure you're not getting at least a The database is used for transactional file locking [...] message on your Admin page? With your provided config I'd expect to see it due to your lack of Redis use (https://docs.nextcloud.com/server/latest/admin_manual/configuration_files/files_locking_transactional.html)

davehayes commented 1 year ago

Just beware the usual caveats inherent in having debug level logging on - particularly if there's any other activity in NC at the time. 😄

Can you be more specific here?

  • add --define apc.enable_cli=1 to your cron.php command-line: e.g. /usr/local/bin/php --define apc.enable_cli=1 /nextcloud/cron.php (or make sure it's in your php.ini - just make sure it's the CLI-mode one, not apache/fpm)

I believe I was required to add this by occ? Nevertheless, my php.ini has this on.

  • Disable PHP's opcache explicitly for the CLI: The easiest way is probably to set opcache.enable_cli=0 in your CLI-mode php.ini [probably default, but just in case its enabled for some reason]

Now I do not have a CLI mode php.ini. I presume the --define can let me do that just for the sake of a command?

It makes me a bit nervous to do testing with one config and live with another. Is this a recommendation for all of nextcloud's CLI ideas (e.g. occ), or just a testing idea?

Is PHP compiled with any weird stuff/options or have any non-standard/non-NC required PHP extensions active? (e.g. php -i | grep Configure, php -m, or similar)

Probably. :)

I build many FreeBSD packages myself for many different sites, almost on a daily basis. The PHP build I use is live for several different instances (e.g. laravel, wordpress) and has requirements I have adhered to. We have never seen an issue like this over a fair number of live servers and many hits per hour. This does not mean there isn't one either, I'm just saying that these builds are fairly well battle tested.

Now having said this, I went and looked here:

P.S. This looks interesting: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266873

Thank you for this! This looks like the low hanging fruit to handle first, as this is actually the easiest thing for me to try. I've seen issues with OpenMP before. I will try a new build.

If you want feedback about that then please keep this issue open (then others who google here can have a solution).

P.P.S. Are you sure you're not getting at least a The database is used for transactional file locking [...] message on your Admin page? With your provided config I'd expect to see it due to your lack of Redis use (https://docs.nextcloud.com/server/latest/admin_manual/configuration_files/files_locking_transactional.html)

No messages like that ever. I do like to keep it simple so I did not install redis, and with the handful of users I have here at the test site I am under the impression that redis will be overkill.

Note that I will consider this if I deploy to a larger site.

joshtrichards commented 1 year ago

Can you be more specific here?

Debug logging, since it's global, will cause all sorts of things to log. It'll just be really noisy. It's not debug mode though. Shouldn't be too bad on any reasonably modern system used by one person. Just beware of the noise.

I believe I was required to add this by occ? Nevertheless, my php.ini has this on.

Oh yeah I forgot we do trigger an exception at the command-line nowadays if APCu is not already enabled.

  • Disable PHP's opcache explicitly for the CLI: The easiest way is probably to set opcache.enable_cli=0 in your CLI-mode php.ini [probably default, but just in case its enabled for some reason]

Now I do not have a CLI mode php.ini. I presume the --define can let me do that just for the sake of a command?

Yes,--define should work for that parameter as well. Probably should have suggested that instead. :smile:

It makes me a bit nervous to do testing with one config and live with another. Is this a recommendation for all of nextcloud's CLI ideas (e.g. occ), or just a testing idea?

The opcache one is just a testing idea.

Is PHP compiled with any weird stuff/options or have any non-standard/non-NC required PHP extensions active? (e.g. php -i | grep Configure, php -m, or similar)

Probably. :)

That could be a factor for why this isn't readily reproducible (maybe).

Now having said this, I went and looked here:

P.S. This looks interesting: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266873

Thank you for this! This looks like the low hanging fruit to handle first, as this is actually the easiest thing for me to try. I've seen issues with OpenMP before. I will try a new build.

:+1:

Btw, I have no idea where pmdk / libmemkind stuff you're seeing is coming from. I don't see it searched for here in my Debian 12 environments.

Oh, and a note on the previously suggested occ background-job:list to look at all your jobs: append -l 100 to the command-line. Otherwise it limits to the first 10 jobs found so it'll be an incomplete list.

davehayes commented 1 year ago

Well, it appears that was it. Over 242 cron jobs in the last day and absolutely zero hangs. :D

The issue was related to the ports system. It seems that it wasn't respecting my options due to some confusion over the flavor. Bottom line is, the OpenMP option to ImageMagick7 should be unticked in FreeBSD to prevent this from occurring. When I did that and reinstalled the library, the problem has gone away.

The pmdk / libmemkind looks to be from OpenMP so far. I haven't traced completely.

I will close this as I consider it resolved. Thank you very much for your assistance.