php / php-src

The PHP Interpreter
https://www.php.net
Other
38.17k stars 7.75k forks source link

FPM: keepalived connection with fastcgi_finish_request causes dead lock #10335

Open Appla opened 1 year ago

Appla commented 1 year ago

Description

The following code:

<?php

$pid = getmypid();
echo 'pid:', $pid, ', start@', $_SERVER['REQUEST_TIME_FLOAT'], ', elapsed: ', microtime(true) - $_SERVER['REQUEST_TIME_FLOAT'], PHP_EOL;

if (isset($_GET['later'])) {
    if ($_GET['later'] === '1') {
        // using PR's version
        // $c = fastcgi_finish_request(true) ? 'modified:true' : 'modified:false';
        $c = fastcgi_finish_request() ? 'raw:true' : 'raw:false';
    } else {
        $c = fastcgi_finish_request() ? 'raw:true' : 'raw:false';
    }
    $ret = req(['pid' => $pid, 'sleep' => 5, 'c' => $c]);
    var_dump($ret);
} elseif (isset($_GET['sleep'])) {
    sleep((int)$_GET['sleep']);
} else {
    echo 'nothing', PHP_EOL;
}
$now = microtime(true);

echo 'pid:', $pid, ', end@', $now, ', elapsed: ', $now - $_SERVER['REQUEST_TIME_FLOAT'], PHP_EOL;

function req(array $args) {
    $tFmt = (new DateTime())->format('Y.m.d.H-i-s.u');
    $path = basename(__FILE__);
    $url = "{$_SERVER['REQUEST_SCHEME']}://{$_SERVER['HTTP_HOST']}/{$path}?reqAt={$tFmt}&" . http_build_query($args);
    return file_get_contents($url, false, stream_context_create([
        'http' => [
            'method'  => 'POST',
            'timeout' => 10,
            'header'  => 'Content-Type: application/x-www-form-urlencoded',
            'content' => http_build_query([
                'a' => 1,
                'b' => 2,
            ]),
        ],
    ]));
}

with FPM settings:

pm = static
pm.max_children = 2

// uncomment if wanna to see access info via fpm access log
; access.format = "%R - %u %t - %T \"%m %r%Q%q\" %s %f %{kilo}M %C%% %p-%P %{mili}d"
; access.log = /path/to/php/fpm-$pool.access.log

and NGINX settings:

## PHP-FPM upstream
upstream PHP_FCGI {
    ## path to fpm listen address
    server 127.0.0.1:9007;

    keepalive 2;
    keepalive_requests 1000;
    keepalive_timeout 60s;
}

## enable fastcgi keepalive
fastcgi_keep_conn on;

server {
    ....
    location = /fpm.php {
        root /path/to/php-file-root/;

        fastcgi_pass PHP_FCGI;
        fastcgi_split_path_info  ^((?U).+\.php)((?s)/.*)$;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param SCRIPT_NAME $fastcgi_script_name;
        fastcgi_param  PATH_INFO  $fastcgi_path_info;
        include        fastcgi_params;
    }
    ....
}

Resulted in this log:

time curl -i 'http://t.fcgi.net/t_fpm.php?later=0'
HTTP/1.1 200 OK
Date: Mon, 16 Jan 2023 08:23:13 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive

pid:680637, start@1673857393.355, elapsed: 6.8187713623047E-5
curl -i 'http://t.fcgi.net/t_fpm.php?later=0'  0.00s user 0.00s system 77% cpu 0.006 total

==> /path/to/nginx/logs/access.log <==
127.0.0.1 - - [16/Jan/2023:16:23:13 +0800] http://t.fcgi.net:80 "GET /t_fpm.php?later=0 HTTP/1.1" 200 73 83 "-" "-" - - "curl/7.68.0" "-" 683465 231 - 127.0.0.1:9007 0.001 0.001
127.0.0.1 - - [16/Jan/2023:16:23:23 +0800] http://t.fcgi.net:80 "POST /t_fpm.php?reqAt=2023.01.16.16-23-13.355151&pid=680637&sleep=5&c=raw%3Atrue HTTP/1.0" 499 0 197 "-" "-" - - "-" "-" 683465 233 - 127.0.0.1:9007 10.008 10.007

==> /path/to/php/fpm-www.access.log <==
127.0.0.1 -  16/Jan/2023:16:23:13 +0800 - 16/Jan/2023:16:23:23 +0800 "GET /t_fpm.php?later=0" 200 /var/www/t_fpm.php 2048 0.00% 680637-680635 10008.038

==> /path/to/php/error/log <==
Jan 16 16:23:23 KB-111 ool www[680637]: PHP Warning:  file_get_contents(http://t.fcgi.net/t_fpm.php?reqAt=2023.01.16.16-23-13.355151&amp;pid=680637&amp;sleep=5&amp;c=raw%3Atrue): failed to open stream: HTTP request failed! in /var/www/t_fpm.php on line 35

==> /path/to/php/fpm-www.access.log <==
127.0.0.1 -  16/Jan/2023:16:23:23 +0800 - 16/Jan/2023:16:23:28 +0800 "POST /t_fpm.php?reqAt=2023.01.16.16-23-13.355151&pid=680637&sleep=5&c=raw%3Atrue" 200 /var/www/fpm.php 2048 0.00% 680637-680635 5000.215

But I expected this output instead:

time curl -i 'http://t.fcgi.net/t_fpm.php?later=1'
HTTP/1.1 200 OK
Date: Mon, 16 Jan 2023 08:27:39 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive

pid:680638, start@1673857659.7505, elapsed: 0.00011897087097168
time curl -i 'http://t.fcgi.net/t_fpm.php?later=1'  0.01s user 0.00s system 73% cpu 0.007 total

==> /path/to/nginx/logs/access.log <==
127.0.0.1 - - [16/Jan/2023:16:27:39 +0800] http://t.fcgi.net:80 "GET /t_fpm.php?later=1 HTTP/1.1" 200 75 83 "-" "-" - - "curl/7.68.0" "-" 683465 234 - 127.0.0.1:9007 0.001 0.000

==> /path/to/php/fpm-www.access.log <==
127.0.0.1 -  16/Jan/2023:16:27:39 +0800 - 16/Jan/2023:16:27:44 +0800 "POST /t_fpm.php?reqAt=2023.01.16.16-27-39.750866&pid=680638&sleep=5&c=raw%3Atrue" 200 /var/www/t_fpm.php 2048 0.00% 680637-680635 5000.297
127.0.0.1 -  16/Jan/2023:16:27:39 +0800 - 16/Jan/2023:16:27:44 +0800 "GET /t_fpm.php?later=1" 200 /var/www/t_fpm.php 2048 0.00% 680638-680635 5001.523

==> /path/to/nginx/logs/access.log <==
127.0.0.1 - - [16/Jan/2023:16:27:44 +0800] http://t.fcgi.net:80 "POST /t_fpm.php?reqAt=2023.01.16.16-27-39.750866&pid=680638&sleep=5&c=true%7Ctrue HTTP/1.0" 200 121 198 "-" "-" - - "-" "-" 683465 236 - 127.0.0.1:9007 5.001 5.000

This is easy to reproduced by the above desc, IMHO the log is clearly show’s the current PHP‘s behavior is something error. the reason I think is after proc-X called fastcgi_finish_request this connection is marked reuseable by proxy and if the same process soon req same host the proxy might select the same connection then deadlock happens.

this also can be showing by GDB/tcpdump

so I made this https://github.com/php/php-src/pull/10273

PHP Version

7.4-8.2

Operating System

Ubuntu 20.04

bukka commented 6 months ago

I have been doing some debugging and this is really an issue. Basically fastcgi_finish_request is pretty much useless when used with keepalive for busy servers. It will return the response sooner but it will introduce start delay for the next request because nginx thinks that it can send other request immediately after. Also as you show here, it introduces dead lock if calling itself. Same reason as nginx thinks that it can send other request so it re-uses connection.

I have been thinking about possible solutions and I can't see a way how to easily address it without closing the connection. Basically it would either require some sort of forking which is most likely more expensive than closing the connection or it would require ZTS which is not usual for FPM and probably not a good idea either due to complexity.

In terms of closing the connection, I'm not sure if it should be done in the function as it all depends on server / FPM configuration so maybe we should just add option that will control if the connection should be always closed after fastcgi_finish_request which might even make sense to be enabled by default as this behaviour is not what users expect. I think no one is expecting fastcgi_finish_request to introduce delay for the next request on busy servers. We should also update docs.

Appla commented 6 months ago

In terms of closing the connection, I'm not sure if it should be done in the function as it all depends on server / FPM configuration so maybe we should just add option that will control if the connection should be always closed after fastcgi_finish_request which might even make sense to be enabled by default as this behaviour is not what users expect. I think no one is expecting fastcgi_finish_request to introduce delay for the next request on busy servers. We should also update docs.

The issue arises when keepalive is used alongside fastcgi_finish_request. To address this, I propose adding a new optional parameter to fastcgi_finish_request that defaults to closing the connection. This approach not only resolves the problem but also gives users the flexibility to modify the default behavior if needed. Introducing a new configuration option to control this default behavior seems reasonable.

bukka commented 5 months ago

The problem is that it requires changes to code for something which is more an operational thing. To me it makes sense to give administrator control over this which is essentially configuration option that effects every fastcgi_finish_request call.

Appla commented 5 months ago

The problem is that it requires changes to code for something which is more an operational thing. To me it makes sense to give administrator control over this which is essentially configuration option that effects every fastcgi_finish_request call.

Sure, I have just introduced a new option as the default value for the fastcgi_finish_request parameter to control connection closure.