Closed dale-c-anderson closed 2 months ago
Note that 8.1 is no longer supported. Is Ondrej repository used ?
@devnexen yes, PHP packages are coming from ondrej's repo.
Would it be a possibility for you to try if the issue occurs still with php 8.2/8.3 ?
It is possible that the signal is sent by the OOM killer. Did you inspect kernel logs? If that's the cause you should see it in these logs.
@devnexen , yes we'll also look at that. It might take a while before we'll be able to put that into an environment to test.
@arnaud-lb , Nothing's getting oom killed according to syslog or kernel.log. The compute instances have loads of free disk (both space and IO), cpu, and mem resources.
Just an update here. We've been able to partially reproduce this in our staging environment by hitting Drupal's cron URL, and then not waiting for the cron call to finish.
Doing so hangs a PHP FPM child process. The child process stays active and never finishes. It doesn't crash, it just never ends, even though Drupal reports all the work done by the cron job as complete. Whatever is using that child process won't let it be used again, and we eventually have to restart or reload the PHP FPM service to free up child threads again.
By contrast, when using PHP CLI (namely drupal's CLI tool, Drush) to trigger Drupal cron, it doesn't have the same effect. When using cli instead of going through PHP FPM, Drupal's cron finishes reliably, and no resources are left hanging.
Now that we have a better idea where to look, will try and reproduce locally so we can set breakpoints and figure out where it's hanging.
@dale-c-anderson you can try to also enable FPM slowlog by configuring request_slowlog_timeout
and possibly slowlog
option as well if the default path is not what you want.
I see that you have slowlog configured through php_admin_value but that's for INI so you should use it as an FPM config.
And you should probably first try to remove pm.max_requests = 250
as this will restarts child after 250 requests which seems like too low. This option should be used only if you are getting some mem leaks (e.g. due bugs in extensions) otherwise it should not be needed or it should be at least set to much higher value.
It might show more if you set debug log_level (global config) which should give us more info. Interesting that it gets killed exactly after 12s. I'm not sure pm.max_requests because that's not handled by master and it just finishes connection. This looks more like you had pm.process_idle_timeout
configured but I don't see it in your config. Definitely enable the debug log as it should give us more info.
@bukka Slowlog never revealed anything at all, and switching to debug level logging in php fpm didn't reveal anything we didn't already know.
FYI the exact same results were seen with PHP 8.2.
At some point we switched pm mode from dynamic to static, which stopped the sigkill warnings, but we still ended up with hanging child processes that never exited.
In any case TL;DR, this looks like it's a new bug in Dynatrace OneAgent: uninstalling OneAgent completely, or downgrading it to a version 1.293
or older made this issue unreproducible.
For posterity, steps to reproduce were to trigger a Drupal 10 cron run with curl (handled by Nginx & PHP FPM) with dynatrace oneagent installed (>= v1.295), and a cron job sending out mail in batches. After a seemingly random number of successful sends (sometimes 100, sometimes 1000 or more), a PHP FPM child process would just hang. Debugging showed it stopped with a call to /usr/sbin/sendmail -bs
from proc_open(), but never went any further. In those hung instances, it doesn't look like it even executed sendmail, since there was no corresponding connect from 127.0.0.1
in the mail log, which is what usually happens when sendmail -bs
is executed.
I'll close this report an issue to Dynatrace instead.
Thanks for all your helpful suggestions.
Description
We've got a lovely production-only bug here running PHP FPM & NGINX
I'm looking for guidance on how to determine what's causing this. After several days of trying to reproduce or get any more dirt on what's happening, the team is starting to suspect a bug in PHP.
Nothing had changed in our deployed PHP (Drupal 10) application for at least a week , and then from out of nowhere, the following sigkill messages start flooding the log files of both our load balanced ec2 worker nodes, at the rate of about one every other second:
Problem behavior
What we've tried and checked
turned up nothing. None of the sigkills have appeared in the audit log.
Current workaround
Reloading or restarting the PHP FPM 8.1 temporarily resolves the issue after it's started, so when there's no one on deck to step in (or catch this thing in the act), we have cron automatically reloading the service on each ec2 every 20 minutes, which is short enough to keep the sigkill messages from even starting.
Environment
Let me know what else might be relevant here:
file: /etc/php/8.1/fpm/php-fpm.conf
file: /etc/php/8.1/fpm/pool.d/application.conf
file: /etc/php/8.1/fpm/pool.d/www.conf # not actually used for anything, but still technically part of the config
PHP Version
PHP 8.1.29
Operating System
Ubuntu 20.04.6 LTS