Seeing randomly frequent SIGKILL messages in logs, followed by PHP FPM service failure

dale-c-anderson commented 2 months ago

Description

We've got a lovely production-only bug here running PHP FPM & NGINX

I'm looking for guidance on how to determine what's causing this. After several days of trying to reproduce or get any more dirt on what's happening, the team is starting to suspect a bug in PHP.

Nothing had changed in our deployed PHP (Drupal 10) application for at least a week , and then from out of nowhere, the following sigkill messages start flooding the log files of both our load balanced ec2 worker nodes, at the rate of about one every other second:

[25-Jul-2024 23:45:38] WARNING: [pool xxxxxx] child 33944 exited on signal 9 (SIGKILL) after 12.011399 seconds from start
[25-Jul-2024 23:45:40] WARNING: [pool xxxxxx] child 33953 exited on signal 9 (SIGKILL) after 12.010867 seconds from start
[25-Jul-2024 23:45:42] WARNING: [pool xxxxxx] child 33968 exited on signal 9 (SIGKILL) after 12.009866 seconds from start
[25-Jul-2024 23:45:44] WARNING: [pool xxxxxx] child 33979 exited on signal 9 (SIGKILL) after 12.009435 seconds from start
[25-Jul-2024 23:45:46] WARNING: [pool xxxxxx] child 33988 exited on signal 9 (SIGKILL) after 12.007592 seconds from start
[25-Jul-2024 23:45:48] WARNING: [pool xxxxxx] child 33997 exited on signal 9 (SIGKILL) after 12.009038 seconds from start
[25-Jul-2024 23:45:50] WARNING: [pool xxxxxx] child 34011 exited on signal 9 (SIGKILL) after 12.009187 seconds from start
[25-Jul-2024 23:45:52] WARNING: [pool xxxxxx] child 34027 exited on signal 9 (SIGKILL) after 12.009532 seconds from start

Problem behavior

About 30-60 minutes after these sigkill messages start happening, PHP FPM starts to fail, and Nginx starts returning 502 Bad Gateway errors.
Sometimes these messages only appear on one ec2, sometimes they're on both.
The warnings started last thursday AM and persisted all day, then almost didn't appear at all on friday, even though friday was almost identical traffic load.
There was one burst of them this morning (monday) on only one node, and then nothing since.
The sigkills return randomly and without warning.
It does appear to happen more often during busier periods. The site's traffic is mostly used by a corporate audience, so M-F 9-5 is the traffic pattern.

What we've tried and checked

There's no shortage of resources. At least 13 GB unused memory on each node, CPU load of 10-20%, almost no disk activity
There's no corresponding errors in our application-level PHP FPM logs.
The number of active child PHP FPM process typically sits between 0 and 3, with 7 total running, and in the busiest part of the day, total child processes will go as high as 15 , but nowhere near maxing out.
Adding auditd to the system, and trying to suss out the origin of the sigkill with
```
auditctl -a always,exit -F arch=b64 -F a1=9 -S kill -k log_kill
```
turned up nothing. None of the sigkills have appeared in the audit log.

Current workaround

Reloading or restarting the PHP FPM 8.1 temporarily resolves the issue after it's started, so when there's no one on deck to step in (or catch this thing in the act), we have cron automatically reloading the service on each ec2 every 20 minutes, which is short enough to keep the sigkill messages from even starting.

Environment

Let me know what else might be relevant here:

$ php8.1 --version
PHP 8.1.29 (cli) (built: Jun  6 2024 16:52:57) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.1.29, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.29, Copyright (c), by Zend Technologies

$ uname -a
Linux xxxxx 5.15.0-1065-aws #71~20.04.1-Ubuntu SMP Fri Jun 28 19:58:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

$ nginx -V
nginx version: nginx/1.18.0 (Ubuntu)
built with OpenSSL 1.1.1f  31 Mar 2020
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2 -fdebug-prefix-map=/build/nginx-BUo7Uw/nginx-1.18.0=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -Wdate-time -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -fPIC' --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --modules-path=/usr/lib/nginx/modules --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-compat --with-debug --with-pcre-jit --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_v2_module --with-http_dav_module --with-http_slice_module --with-threads --with-http_addition_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_sub_module

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       1.6Gi        10Gi        65Mi       3.5Gi        13Gi
Swap:            0B          0B          0B

file: /etc/php/8.1/fpm/php-fpm.conf

[global]
pid = /run/php/php8.1-fpm.pid
error_log = /var/log/php8.1-fpm.log
include=/etc/php/8.1/fpm/pool.d/*.conf

file: /etc/php/8.1/fpm/pool.d/application.conf

[myapp]
user = myserviceuser
group = myservicegroup
listen = /var/run/php8.1-fpm-my-app.sock
listen.owner = www-data
listen.group = www-data
pm = dynamic
pm.max_children = 45
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 5
pm.max_requests = 250
pm.status_path = /status
catch_workers_output = no
chdir = /
php_admin_flag[log_errors] = on
php_admin_value[error_log] = /var/log/vhosts/myuser/myapp/php-fpm-error.log
php_admin_value[max_execution_time] = 300
php_admin_value[default_socket_timeout] = 300
security.limit_extensions = .php
php_admin_value[open_basedir] = "/home/myuser/www/myapp/:/tmp/"
php_value[newrelic.appname] = "my-app (production @ aws)"
php_flag[expose_php] = off
php_admin_value[upload_max_filesize] = 100M
php_admin_value[post_max_size] = 100M
php_admin_value[memory_limit] = 256M
php_admin_value[max_input_vars] = 3001
php_admin_value[short_open_tag] = Off
php_admin_value[request_slowlog_timeout] = 5s
php_admin_value[slowlog] = /var/log/vhosts/myuser/myapp/php-fpm-slow.log

file: /etc/php/8.1/fpm/pool.d/www.conf # not actually used for anything, but still technically part of the config

[www]
user = www-data
group = www-data
listen = /run/php/php8.1-fpm.sock
listen.owner = www-data
listen.group = www-data
pm = dynamic
pm.max_children = 5
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 3

PHP Version

PHP 8.1.29

Operating System

Ubuntu 20.04.6 LTS

devnexen commented 2 months ago

Note that 8.1 is no longer supported. Is Ondrej repository used ?

dale-c-anderson commented 2 months ago

@devnexen yes, PHP packages are coming from ondrej's repo.

devnexen commented 2 months ago

Would it be a possibility for you to try if the issue occurs still with php 8.2/8.3 ?

arnaud-lb commented 2 months ago

It is possible that the signal is sent by the OOM killer. Did you inspect kernel logs? If that's the cause you should see it in these logs.

dale-c-anderson commented 2 months ago

@devnexen , yes we'll also look at that. It might take a while before we'll be able to put that into an environment to test.

@arnaud-lb , Nothing's getting oom killed according to syslog or kernel.log. The compute instances have loads of free disk (both space and IO), cpu, and mem resources.

dale-c-anderson commented 2 months ago

Just an update here. We've been able to partially reproduce this in our staging environment by hitting Drupal's cron URL, and then not waiting for the cron call to finish.

Doing so hangs a PHP FPM child process. The child process stays active and never finishes. It doesn't crash, it just never ends, even though Drupal reports all the work done by the cron job as complete. Whatever is using that child process won't let it be used again, and we eventually have to restart or reload the PHP FPM service to free up child threads again.

By contrast, when using PHP CLI (namely drupal's CLI tool, Drush) to trigger Drupal cron, it doesn't have the same effect. When using cli instead of going through PHP FPM, Drupal's cron finishes reliably, and no resources are left hanging.

Now that we have a better idea where to look, will try and reproduce locally so we can set breakpoints and figure out where it's hanging.

bukka commented 2 months ago

@dale-c-anderson you can try to also enable FPM slowlog by configuring request_slowlog_timeout and possibly slowlog option as well if the default path is not what you want.

bukka commented 2 months ago

I see that you have slowlog configured through php_admin_value but that's for INI so you should use it as an FPM config.

And you should probably first try to remove pm.max_requests = 250 as this will restarts child after 250 requests which seems like too low. This option should be used only if you are getting some mem leaks (e.g. due bugs in extensions) otherwise it should not be needed or it should be at least set to much higher value.

bukka commented 2 months ago

It might show more if you set debug log_level (global config) which should give us more info. Interesting that it gets killed exactly after 12s. I'm not sure pm.max_requests because that's not handled by master and it just finishes connection. This looks more like you had pm.process_idle_timeout configured but I don't see it in your config. Definitely enable the debug log as it should give us more info.

dale-c-anderson commented 2 months ago

@bukka Slowlog never revealed anything at all, and switching to debug level logging in php fpm didn't reveal anything we didn't already know.

FYI the exact same results were seen with PHP 8.2.

At some point we switched pm mode from dynamic to static, which stopped the sigkill warnings, but we still ended up with hanging child processes that never exited.

In any case TL;DR, this looks like it's a new bug in Dynatrace OneAgent: uninstalling OneAgent completely, or downgrading it to a version 1.293 or older made this issue unreproducible.

For posterity, steps to reproduce were to trigger a Drupal 10 cron run with curl (handled by Nginx & PHP FPM) with dynatrace oneagent installed (>= v1.295), and a cron job sending out mail in batches. After a seemingly random number of successful sends (sometimes 100, sometimes 1000 or more), a PHP FPM child process would just hang. Debugging showed it stopped with a call to /usr/sbin/sendmail -bs from proc_open(), but never went any further. In those hung instances, it doesn't look like it even executed sendmail, since there was no corresponding connect from 127.0.0.1 in the mail log, which is what usually happens when sendmail -bs is executed.

I'll close this report an issue to Dynatrace instead.

Thanks for all your helpful suggestions.

php / php-src