php / php-src

The PHP Interpreter
https://www.php.net
Other
37.57k stars 7.69k forks source link

opcache + MPM ITK might cause apache2 to hang #9910

Open diab303 opened 1 year ago

diab303 commented 1 year ago

Description

This is referring to the bug that is present since at least 5.6. It still happens in 7.4 (just confirmed) and the same code is present in 8.x so I assume the behaviour is the same there, too.

Apache + PHP + MPM_ITK hangs at random intervals spamming "Cannot kill process" in the apache error log.

The original bug report is here: https://bugs.php.net/bug.php?id=76482

The fix looks pretty trivial (break the loop when kill fails) but I'm not sure if it would cause other problems.

Quoting the bug history:

It was reported in 5.6.36 but the bug is in 7.1 (reproduced here) and by looking into source code also in current 7.3 branch.

Not only mpm itk is affected but any MPM which switches child processes to different UIDs (via setuid()). Other example is mod_ruid2 - https://github.com/mind04/mod-ruid2/issues/17

kill_all_lockers() has a loop which never ends if killing fails as kill(mem_usage_check->l_pid, SIGKILL) returns EPERM in errno.

There is

if (!success) {
/* errno is not ESRCH or we ran out of tries to kill the locker */
ZCSG(force_restart_time) = time(NULL); /* restore forced restart request */
/* cannot kill the locker, bail out with error */
zend_accel_error(ACCEL_LOG_ERROR, "Cannot kill process %d: %s!", mem_usage_check->l_pid, strerror(errno));
}

but it doesn't break the loop. Is brake missing here?

PHP Version

PHP 7.4

Operating System

Debian 11

cmb69 commented 1 year ago

The fix looks pretty trivial (break the loop when kill fails) but I'm not sure if it would cause other problems.

I don't think that either of the two patches attached to https://bugs.php.net/76482 would help, because on the previous line the process is supposed to be terminated, so the break would never be executed.

diab303 commented 1 year ago

I've finally found a way to reliably reproduce this, that should help with debugging.

Sat Dec 10 18:26:19 2022 (508078): Message Cached script '/var/www/test1/wait.php' Sat Dec 10 18:26:21 2022 (508080): Message Cached script '/var/www/test2/kill.php' Sat Dec 10 18:26:21 2022 (508080): Debug Restart Scheduled! Reason: user Sat Dec 10 18:26:32 2022 (508083): Warning Forced restart at 1670693192 (after 10 seconds), locked by 508078 Sat Dec 10 18:26:32 2022 (508083): Warning Attempting to kill locker 508078 Sat Dec 10 18:26:32 2022 (508083): Error Cannot kill process 508078: Operation not permitted!

[edit]

Tried patching /ext/opcache/ZendAccelerator.c (this is Debian's 7.4 src) to fix this behaviour, based on this patch: https://bugs.php.net/patch-display.php?bug_id=76482&patch=dont-while-loop-for-ever-php-7.3-only-for-noperm&revision=latest

--- /ext/opcache/ZendAccelerator.c  2022-06-07 10:38:20.000000000 +0200
+++ ZendAccelerator.c   2022-12-10 20:22:56.763530921 +0100
@@ -780,21 +780,25 @@
        success = 0;
        tries = 10;

        while (tries--) {
            zend_accel_error(ACCEL_LOG_WARNING, "Attempting to kill locker %d", mem_usage_check->l_pid);
            if (kill(mem_usage_check->l_pid, signal)) {
                if (errno == ESRCH) {
                    /* Process died before the signal was sent */
                    success = 1;
                    zend_accel_error(ACCEL_LOG_WARNING, "Process %d died before SIGKILL was sent", mem_usage_check->l_pid);
-               }
+               } else if (errno == EPERM) {
+                   /* We have no permission to kill the other process, so no point in trying */
+                   success = 1;
+                   zend_accel_error(ACCEL_LOG_WARNING, "No permission to kill Process %d", mem_usage_check->l_pid);
+               }
                break;
            }
            /* give it a chance to die */
            usleep(20000);
            if (kill(mem_usage_check->l_pid, 0)) {
                if (errno == ESRCH) {
                    /* successfully killed locker, process no longer exists  */
                    success = 1;
                    zend_accel_error(ACCEL_LOG_WARNING, "Killed locker %d", mem_usage_check->l_pid);
                }

It improved it (web requests didn't hang instantly) then I eventually ended up with an endless loop like this, so there is definitely room for improvement..

Sat Dec 10 20:31:22 2022 (1108306): Warning Attempting to kill locker 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning No permission to kill Process 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning Attempting to kill locker 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning No permission to kill Process 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning Attempting to kill locker 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning No permission to kill Process 1108241

arekm commented 3 months ago

I guess the same problem will happen if there is an apparmor policy preventing kill signal from being delivered to other httpd processes.

Code should handle EPERM in saner way than delay loop.

rahulthackkar commented 5 days ago

I've finally found a way to reliably reproduce this, that should help with debugging.

  • Install Apache2 + MPM ITK + mod php 7.4
  • create two virtual hosts running as two different users (assignuserid in apache conf)
  • set opcache.force_restart_timeout=10
  • call vhost1/wait.php: <?php sleep(1000);
  • call vhost2/kill.php: <?php opcache_reset(); After 10 seconds calling any php script will trigger the forced restart (I kept reloading kill.php a few times) This reliably results in the following in apache log, then all PHP scripts return an empty page until apache is restarted

Sat Dec 10 18:26:19 2022 (508078): Message Cached script '/var/www/test1/wait.php' Sat Dec 10 18:26:21 2022 (508080): Message Cached script '/var/www/test2/kill.php' Sat Dec 10 18:26:21 2022 (508080): Debug Restart Scheduled! Reason: user Sat Dec 10 18:26:32 2022 (508083): Warning Forced restart at 1670693192 (after 10 seconds), locked by 508078 Sat Dec 10 18:26:32 2022 (508083): Warning Attempting to kill locker 508078 Sat Dec 10 18:26:32 2022 (508083): Error Cannot kill process 508078: Operation not permitted!

[edit]

Tried patching /ext/opcache/ZendAccelerator.c (this is Debian's 7.4 src) to fix this behaviour, based on this patch: https://bugs.php.net/patch-display.php?bug_id=76482&patch=dont-while-loop-for-ever-php-7.3-only-for-noperm&revision=latest

--- /ext/opcache/ZendAccelerator.c    2022-06-07 10:38:20.000000000 +0200
+++ ZendAccelerator.c 2022-12-10 20:22:56.763530921 +0100
@@ -780,21 +780,25 @@
      success = 0;
      tries = 10;

      while (tries--) {
          zend_accel_error(ACCEL_LOG_WARNING, "Attempting to kill locker %d", mem_usage_check->l_pid);
          if (kill(mem_usage_check->l_pid, signal)) {
              if (errno == ESRCH) {
                  /* Process died before the signal was sent */
                  success = 1;
                  zend_accel_error(ACCEL_LOG_WARNING, "Process %d died before SIGKILL was sent", mem_usage_check->l_pid);
-             }
+             } else if (errno == EPERM) {
+                 /* We have no permission to kill the other process, so no point in trying */
+                 success = 1;
+                 zend_accel_error(ACCEL_LOG_WARNING, "No permission to kill Process %d", mem_usage_check->l_pid);
+                 }
              break;
          }
          /* give it a chance to die */
          usleep(20000);
          if (kill(mem_usage_check->l_pid, 0)) {
              if (errno == ESRCH) {
                  /* successfully killed locker, process no longer exists  */
                  success = 1;
                  zend_accel_error(ACCEL_LOG_WARNING, "Killed locker %d", mem_usage_check->l_pid);
              }

It improved it (web requests didn't hang instantly) then I eventually ended up with an endless loop like this, so there is definitely room for improvement..

Sat Dec 10 20:31:22 2022 (1108306): Warning Attempting to kill locker 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning No permission to kill Process 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning Attempting to kill locker 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning No permission to kill Process 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning Attempting to kill locker 1108241 Sat Dec 10 20:31:22 2022 (1108306): Warning No permission to kill Process 1108241

I am facing similar issue in production with multi user environment for php8.2 fpm + opcache each having different pool running with different ownership, also able to reproduce it in local. Related Bug reported : https://github.com/php/php-src/issues/14605