swoole / swoole-src

🚀 Coroutine-based concurrency library for PHP
https://www.swoole.com
Apache License 2.0
18.33k stars 3.16k forks source link

Swoole leaking connections #5144

Closed mrAndersen closed 10 months ago

mrAndersen commented 10 months ago

I am getting this stats on production image

bff_swoole_connection_num = connection_num metric from http server stats() method and other metrics similarly. Swoole version = 5.0.3 Also I am getting this error [FATAL ERROR]: all coroutines (count: 1) are asleep - deadlock!

Server started with following parameters:

$this->swooleServer->set([
            Constant::OPTION_MAX_REQUEST => 50000,
            Constant::OPTION_MAX_REQUEST_GRACE => 5000,
            Constant::OPTION_LOG_LEVEL => SWOOLE_LOG_ERROR,
            Constant::OPTION_PID_FILE => '/tmp/bff.swoole.pid',
            Constant::OPTION_USER => $this->user,
            Constant::OPTION_GROUP => $this->group,
            Constant::OPTION_DAEMONIZE => false,
            Constant::OPTION_ENABLE_COROUTINE => true,
            Constant::OPTION_DISPATCH_MODE => SWOOLE_DISPATCH_FDMOD,
            Constant::OPTION_OPEN_CPU_AFFINITY => true,
            Constant::OPTION_WORKER_NUM => $this->threads,
            Constant::OPTION_HOOK_FLAGS => SWOOLE_HOOK_ALL,
            Constant::OPTION_DNS_SERVER => $this->dnsServer,
            Constant::OPTION_ENABLE_STATIC_HANDLER => false,
            Constant::OPTION_INPUT_BUFFER_SIZE => 8 * 1024 * 1024,
            Constant::OPTION_TCP_FASTOPEN => true,
            Constant::OPTION_MAX_CONN => 4096,
            Constant::OPTION_MAX_CONNECTION => 4096,
            Constant::OPTION_OPEN_TCP_NODELAY => true,
            Constant::OPTION_BACKLOG => 512,
        ]);
matyhtf commented 10 months ago

@mrAndersen The server runs using SWOOLE_BASE, when the worker process has a fatal error and crash, the connection num will be inaccurate, which has been fixed in the latest

https://github.com/swoole/swoole-src/commit/a71a03455cdefd803c8fc40b8688998c5edfb6f0

XDRiVE888 commented 10 months ago

@matyhtf Maybe this is not a cosmetic defect, but a real connections leak(zombies)?

In my case, this is exactly what happened, and I also checked it in version swoole 5.1.0-dev.

The only thing I was able to find is that if a message from the server buffer (for example, a websocket) in base mode was not completely sent to the client and at that moment the connection was closed not by the client or the server, then the server calls the close callback, but not closes the socket and puts unsent data in an infinite queue waiting to be sent. This happens at this point in the code: https://github.com/swoole/swoole-src/blob/2edc4d99dc35f083751f669fb12d335ec9e3b301/src/server/base.cc#L155-L163

And here’s what I also noticed while tracking the path to this point, along the way to this point this function is called: https://github.com/swoole/swoole-src/blob/2edc4d99dc35f083751f669fb12d335ec9e3b301/src/server/worker.cc#L158

that is a function in: https://github.com/swoole/swoole-src/blob/2edc4d99dc35f083751f669fb12d335ec9e3b301/src/server/base.cc#L87

I don’t know if it’s a bug or not that the int flags variable is replaced with false, but replacing false with a specific value could disable the infinite wait for sending after an unknown connection break and thereby eliminate the socket leak. In general, I still don’t understand why a deferred closed connection buffer is needed, perhaps to serve http connection closed requests, but..... after all, control over the socket is lost. In my case, the server setting 'heartbeat_idle_time' => 60 helped eliminate the connection leak, which actually timeouts the zombie connections after they are closed :) But I believe that this is not an entirely ideal solution to combat these zombie sockets, since the inactive connection will also be disconnected after a while.

NathanFreeman commented 10 months ago

@mrAndersen @XDRiVE888 Hi. I have made a pr about it, see https://github.com/swoole/swoole-src/pull/5149

XDRiVE888 commented 10 months ago

@NathanFreeman Thank you, I tested it myself and this problem really disappeared, at least for me)