swoole / swoole-src

🚀 Coroutine-based concurrency library for PHP
https://www.swoole.com
Apache License 2.0
18.33k stars 3.16k forks source link

swoole:5.1.2-php8.3 - Maximum call stack size of 8339456 bytes (zend.max_allowed_stack_size - zend.reserved_stack_size) reached. #5347

Closed ValiDrv closed 1 month ago

ValiDrv commented 1 month ago

Short version

swoole:5.1.2 and php8.3 gives randomly Maximum call stack ... reached. Infinite recursion warnings and sometimes even hangs when using dispatch_func.

It might be hard to replicate, sometimes I get it 50 times per day, sometimes once in 4 hours... but I included a stripped down version of the code that I could replicate the issue with.

Note: It's not a hardware capacity issue, since Swoole can serve 70k RPS on our hardware and this issue happens with 20rps, 5k rps or 50k rps. Servers are Ubuntu 22.04.3 LTS, and can handle more than 1mill connections, multiple IPs, ip_local_port_range, ulimit, somaxconn, etc.

The only difference is PHP 8.3 vs PHP 8.2 and the Swoole version.

Long version

Please answer these questions before submitting your issue.

  1. What did you do? If possible, provide a simple script for reproducing the error.

There are some really strange/weird and intermittent issues with Swoole 5.1.2 and PHP 8.3 when using dispatch_func, as explained after the code.

cat docker-compose.yml

version: '3.8'
services:
  swoole:
    image: phpswoole/swoole:5.1.2-php8.3
      #network_mode: host
    ports:
    - '13091:13091'
    environment:
      AUTORELOAD_PROGRAMS: "swoole"
      AUTORELOAD_ANY_FILES: 1
      BOOT_MODE: TASK
    restart: unless-stopped
    volumes:
      - .:/var/www

cat server.php

<?php

declare(strict_types=1);

use Swoole\Http\Server as Server;
use Swoole\Http\Request as Request;
use Swoole\Http\Response as Response;

class myServer
{
    public Server $server;

    public function __construct(string $listen_ip = '0.0.0.0', $listen_port = 13091, int $worker_num = 4) {
        Swoole\Runtime::enableCoroutine(SWOOLE_HOOK_ALL | SWOOLE_HOOK_NATIVE_CURL);
        $http = new Server($listen_ip, $listen_port, SWOOLE_PROCESS);

        $http->on('request', fn(Request $request, Response $response) => $this->onRequest($request, $response, $http));

    # Setting
        $http->set(
            [
                # LOG
                'daemonize' => false,    # Must be FALSE, else can't kill workers in DEV...
                'log_file' => null,      # Only works if daemonize is true, so instead we use the STDOUT/STDERR
                'log_level' => SWOOLE_LOG_ERROR,
                'log_rotation' => SWOOLE_LOG_ROTATION_SINGLE,

                'tcp_fastopen' => true,     # This feature can improve the response speed of TCP short connections by sending data with the SYN packet when the client completes the third step of the handshake.
                'tcp_user_timeout' => 10 * 1000, # The TCP_USER_TIMEOUT option is a socket option at the TCP layer, with a value indicating the maximum time in milliseconds to wait for an ACK confirmation after a data packet is sent. Please refer to the man page for details.

                # GZIP 1-9, higher uses more CPU
                'http_gzip_level' => 7,
                'http_compression' => true,
                # Options
                'max_connection' => 1000000,                # Accept connections
                'buffer_output_size' => 32 * 1024 * 1024,   # 32M (byte) can't send files larger than this.
                'socket_buffer_size' => 32 * 1024 * 1024,   # Set the socket buffer size
                 'max_queued_bytes' => 32 * 1024 * 1024,   # Set the maximum queue length of the received buffer. If exceeded, stop receiving.
                'max_conn' => 32768,                        # the max tcp connection number of the server
                'backlog' => 327680,                            # How many connections waiting for accept can be handled simultaneously (net.core.somaxconn=32768 & net.ipv4.tcp_max_syn_backlog=32768). Note: `netstat -s | grep 'times the listen queue of a socket overflowed'`
                # 'max_request' => 10000,                     # Restart workers after X requests
                'open_cpu_affinity' => true,                # Stick to CPU core (helps with cpu cache)
                # 'discard_timeout_request' => true,          # Discard worker received data after closing connection. NOTE: might affect web workers.

                # Max coroutines
                "max_coro_num" => 1000000,                  # Total coroutines, default 3000
                "max_coroutine" => 1000000,                  # Total coroutines, default 3000

                # Threads
                'worker_num' => $worker_num,
                'task_worker_num' => 0,
                'task_enable_coroutine' => true,    # Add coroutines for task workers, so they don't block
                'enable_coroutine' => true,           # Automatically add coroutine to onRequest
                'reactor_num' => 4,  # Number of events IO threads, normally number of CPU cores, but at one we don't lock between threads

                'enable_reuse_port' => true,    # Enabling port reuse allows for starting multiple Server programs listening on the same port.
                # Keep-Alive connections.
                'open_tcp_nodelay' => true,
                'tcp_keepidle' => 600,
                'tcp_keepinterval' => 60,
        'tcp_keepcount' => 1000,
        # Custom dispatch function
        'dispatch_func' => $this->dispatch_func(...),
            ]);

        # Need to access
        $this->server = $http;

        # Last this is to start it
    echo $this->server->start() ? "Server started" : "Could not start server";
    }

    public function onRequest(Request $swoole_request, Response $swoole_response, Server $swoole_server): void {
    if($swoole_request->server['request_uri'] !== '/ping') {
        echo getmypid() . " onRequest({$swoole_server->getWorkerId()}): {$swoole_request->server['request_uri']}\n";
        Co::sleep(2);
        }
        $swoole_response->status(200);
        $swoole_response->end('OK');
    }

   # Used to return different worker ids based on the URL and whatnot.
    public function dispatch_func(Server $swoole_server, int $fd, int $type, string $data = null): int {

#       if ($type !== 0 || is_null($data)) {
#           echo "drop: $fd";
            # Use worker ID#0 to open/close the connection (does not matter what worker we get, but we don't have the data to select a worker...)
            # - 3: indicating that the client connection is established
            # - 4: indicating that the client connection is closed
#            return 0;
#   }

        preg_match("~\A(\S+)\s(\S+)\s~", $data ?? '', $m);
        $method = $m[1] ?? 'GET';
        $url = $m[2] ?? '';

        return $fd % $swoole_server->setting['worker_num'];
    }
}

echo "\n----------------------\n";
echo `php --ri swoole`;
echo "\n----------------------\n";
echo `php --version`;
echo "\n----------------------\n";
echo `gcc --version`;
echo "\n----------------------\n";

echo "\n\n -== SERVER RESTART ==- \n\n";
$r = new myServer();

cat makefile

# Hide Entering/Leaving directory
MAKEFLAGS += --no-print-directory
# If a file exists with the make target, still run the target
MAKEFLAGS += --always-make

RUN_DOCKER_COMPOSE := CURDIR=$(CURDIR) HOSTNAME=$(shell hostname) docker-compose --compatibility -p swoole-test -f ./docker-compose.yml

logs:
    @$(RUN_DOCKER_COMPOSE) logs -t -f --tail=100

##
# Stop and remove passed in ${SERVICES} exited containers
down:
    # Stopping and removing service containers...
    $(RUN_DOCKER_COMPOSE) down -v --remove-orphans || true

##
# Build the passed in ${SERVICES}
build: down
    # Get latest container versions
    @$(RUN_DOCKER_COMPOSE) pull
    # Building containers
    @$(RUN_DOCKER_COMPOSE) build --parallel
    # Starting containers and waiting for healthy status...
    @$(RUN_DOCKER_COMPOSE) up -d --force-recreate --wait
    # Containers started [OK]
    @$(RUN_DOCKER_COMPOSE) ps

ps:
    $(RUN_DOCKER_COMPOSE) ps
  1. What did you expect to see?

Most the time the server works great, no errors, fast response and so on.

time ab -k -c 1024 -n 10000 http://x.x.x.x:13091/ping
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking x.x.x.x (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:        swoole-http-server
Server Hostname:        x.x.x.x
Server Port:            13091

Document Path:          /ping
Document Length:        2 bytes

Concurrency Level:      1024
Time taken for tests:   0.375 seconds
Complete requests:      10000
Failed requests:        0
Keep-Alive requests:    10000
Total transferred:      1540000 bytes
HTML transferred:       20000 bytes
Requests per second:    26692.22 [#/sec] (mean)
Time per request:       188.220 [ms] (mean)
Time per request:       0.037 [ms] (mean, across all concurrent requests)
Transfer rate:          4014.26 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   45  48.5     40     138
Processing:    28   88  23.7     91     137
Waiting:        1   88  23.7     91     128
Total:         28  132  40.0    132     211

Percentage of the requests served within a certain time (ms)
  50%    132
  66%    163
  75%    170
  80%    173
  90%    182
  95%    190
  98%    198
  99%    202
 100%    211 (longest request)

real    0m0.437s
user    0m0.071s
sys 0m0.252s
  1. What did you see instead?

BUT, every now and then, usually fixed by or after a server restart, we get this error for some request.

 Fatal error: Uncaught Error: Maximum call stack size of 8339456 bytes (zend.max_allowed_stack_size - zend.reserved_stack_size) reached. Infinite recursion? in /var/www/server.php:109
 Stack trace:
 #0 [internal function]: myServer->dispatch_func(Object(Swoole\Http\Server), 5056, 4)
 #1 /var/www/server.php(98): Swoole\Http\Response->end('OK')
 #2 /var/www/server.php(21): myServer->onRequest(Object(Swoole\Http\Request), Object(Swoole\Http\Response), Object(Swoole\Http\Server))
 #3 [internal function]: myServer->{closure}(Object(Swoole\Http\Request), Object(Swoole\Http\Response))
 #4 {main}

The weird some requests part:

Usually, AFTER this happens, we start to get 504 timeouts (2min) in our proxy logs, and just before, but not always accompanied by Maximum call stack size errors, latency starts to go up and up.

time ab -k -c 1024 -n 10000 http://x.x.x.x:13091/ping
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking x.x.x.x (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:        swoole-http-server
Server Hostname:        10.9.9.113
Server Port:            13091

Document Path:          /ping
Document Length:        2 bytes

Concurrency Level:      1024
Time taken for tests:   12.545 seconds
Complete requests:      10000
Failed requests:        0
Keep-Alive requests:    10000
Total transferred:      1540000 bytes
HTML transferred:       20000 bytes
Requests per second:    797.13 [#/sec] (mean)
Time per request:       6302.607 [ms] (mean)
Time per request:       1.254 [ms] (mean, across all concurrent requests)
Transfer rate:          119.88 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0 3672 4614.4    120   12272
Processing:    38  104 124.5    112   12270
Waiting:        1  103  26.3    112     147
Total:         38 3777 4607.8    259   12384

Percentage of the requests served within a certain time (ms)
  50%    259
  66%   5036
  75%   7728
  80%   9229
  90%  12232
  95%  12314
  98%  12321
  99%  12359
 100%  12384 (longest request)

real    0m12.614s
user    0m0.076s
sys 0m12.393s
  1. What version of Swoole are you using (show your php --ri swoole)?

    swoole
    
    Swoole => enabled
    Author => Swoole Team <team@swoole.com>
    Version => 5.1.2
    Built => Apr 29 2024 21:13:40
    coroutine => enabled with boost asm context
    epoll => enabled
    eventfd => enabled
    signalfd => enabled
    cpu_affinity => enabled
    spinlock => enabled
    rwlock => enabled
    sockets => enabled
    openssl => OpenSSL 3.0.11 19 Sep 2023
    dtls => enabled
    http2 => enabled
    json => enabled
    curl-native => enabled
    zlib => 1.2.13
    mutex_timedlock => enabled
    pthread_barrier => enabled
    futex => enabled
    mysqlnd => enabled
    async_redis => enabled
    coroutine_pgsql => enabled
    
    Directive => Local Value => Master Value
    swoole.enable_coroutine => On => On
    swoole.enable_library => On => On
    swoole.enable_fiber_mock => Off => Off
    swoole.enable_preemptive_scheduler => Off => Off
    swoole.display_errors => On => On
    swoole.use_shortname => On => On
    swoole.unixsock_buffer_size => 8388608 => 8388608
  2. What is your machine environment used (show your uname -a & php -v & gcc -v) ?

 PHP 8.3.2 (cli) (built: Feb 13 2024 05:36:23) (NTS)
 Copyright (c) The PHP Group
 Zend Engine v4.3.2, Copyright (c) Zend Technologies

 ----------------------
 gcc (Debian 12.2.0-14) 12.2.0
 Copyright (C) 2022 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
NathanFreeman commented 1 month ago

I am trying to reproduce the issue by stress-testing the code for a long time with wrk.

ValiDrv commented 1 month ago

For the 504 http code errors, wrk will not work, since it counts responses and not the response code. So if one takes 2 min it still counts it as GOOD. Use ab, and you see the non 200 response codes. (basically I could replicate it about 20 times more often with ab compared to wrk)

Also, it seems unrelated to the number of requests/load past a certain amount. As in, I got 12 failures in 1000 requests, and 16 in 10k requests.

For the Maximum call stack ... reached. Infinite recursion it seems more related to how/when the server is restarted. Sometimes I don't get it for days/weeks, other times I get it on every request. (so we just deploy a few times until it doesn't happen) And once running, it just "randomly" happens for a few minutes, and then stops by itself.

NathanFreeman commented 1 month ago

@ValiDrv We have released Swoole v5.1.3. You can try Swoole v5.1.3 to see if this issue still exists.

ValiDrv commented 1 month ago

We moved away from using dispatch_func in production, but will keep an eye on it.