ratchetphp / Ratchet

Asynchronous WebSocket server
http://socketo.me
MIT License
6.25k stars 728 forks source link

Ratchet on 100% CPU load #939

Open ollm opened 2 years ago

ollm commented 2 years ago

Hi.

During the last month, I am having a problem with running a Ratchet server in a production environment, suddenly the process goes from normally using 0-1% of a CPU core to 98-100%, the websocket server still working but it takes longer to respond, this happens in a variety of ways, sometimes it takes a few hours to happen and others it takes days / weeks.

When it enters this state it does not exit again until I kill the process and supervisor starts a new one.

I am also checking the logic of my application so that it is not a problem on my part.

Apache version

Server Version: Apache/2.4.51 (cPanel) OpenSSL/1.1.1g mod_bwlimited/1.4
Server MPM: event
Server Built: Nov 7 2021 17:32:31

PHP version

PHP 8.0.12 (cli) (built: Nov 10 2021 01:18:26) ( NTS )
Copyright (c) The PHP Group
Zend Engine v4.0.12, Copyright (c) Zend Technologies
    with Zend OPcache v8.0.12, Copyright (c), by Zend Technologies

Ratchet version being used

cboden/ratchet v0.4.4 PHP WebSocket library
├──guzzlehttp/psr7 v1.8.3
├──ratchet/rfc6455 v0.3.1
├──react/event-loop v1.2.0
├──react/socket v1.10.0
├──symfony/http-foundation v6.0.1
└──symfony/routing v4.4.34

Event loop

ExtEvLoop
ev (1.1.4)

supervisor.conf

[unix_http_server]
file = /tmp/supervisor.sock

[supervisord]
logfile          = ./logs/supervisord.log
logfile_maxbytes = 50MB
logfile_backups  = 10
loglevel         = info
pidfile          = /tmp/supervisord.pid
nodaemon         = false
minfds           = 4096
minprocs         = 200

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl = unix:///tmp/supervisor.sock

[program:ratchet]
directory               = /home/username/public_html
command                 = bash -c "ulimit -n 10000; exec php chat.php"
process_name            = Chat
numprocs                = 1
autostart               = true
autorestart             = true
user                    = root
stdout_logfile          = ./logs/chat_info.log
stdout_logfile_maxbytes = 1MB
stderr_logfile          = ./logs/chat_error.log
stderr_logfile_maxbytes = 1MB

I have this code in the app, but there is no record in chat_info.log or chat_error.log

public function onError(ConnectionInterface $conn, \Exception $e)
{
    echo "An error has occurred: {$e->getMessage()}\n";

    $conn->close();
}

This is what appears if I run strace in the pid

strace

``` --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) ```

Thank you for your replies.

WyriHaximus commented 2 years ago

What happens at the of the sudden increase in CPU usage, like more users/messages/anything? Are you using any blocking code such as PDO or sleep or file_get_contents?

ollm commented 2 years ago

I still haven't been able to see how many users there are when it happens, since I realize hours later, but there are currently 200 users and there must be probably about 300/400 at their peak. currently the onMessage function receives about 10 messages per minute.

Are you using any blocking code such as PDO or sleep or file_get_contents?

My code has SQL queries and also uses the filemtime function.

I don't know if this has something to do with it, but I just noticed that the code that I use in onOpen and onColse to save the client is different from the one in the documentation.

I have something like to this:

public function onOpen(ConnectionInterface $conn)
{
    $this->clients[$conn->resourceId] = $conn;
}

public function onClose(ConnectionInterface $conn)
{
    // The connection is closed, remove it, as we can no longer send it messages
    unset($this->clients[$conn->resourceId]);
}

Thank for your replie.

ahmeteminkocal commented 2 years ago

You can make use of xdebug to find out exaclty where your app is struggling.

zhiyong-ft commented 2 years ago

@ollm two items come to mind: 1) You are running PHP8.0 Is this a good idea for production server? I read in some of the issues that ZMQ has trouble with PHP7.2+. I had trouble with PHP7.2+ on windows, but so far not on Ubuntu. Also not on a production server yet. 2) run garbage collection upon connection close? Also in one of issues from years ago, it was mentioned that we need to force garbage collection once in a while.

Just some observation, hope it helps.

zhiyong-ft commented 2 years ago

The other thing to try is use either artilery or autobahn to stress test your application or server, and see if this happens again.

ollm commented 2 years ago

I have been testing wsstat and artillery to stress the server and from 1000-1200 some of the new connections start to fail (wsstat shows error 429 and artillery shows ECONNRESET), up to 2200 connections, I don't know if this could be due to the websocket going through Cloudflare or it is some problem with my internet connection, I had seen in other issues that they have problems when exceeding 1024 connections, but it seems that this is not the same problem, during the stress test the process reaches 100% CPU, but once finished it returns to normal values.

Now I am testing with PHP 7.2, which is the oldest version that I currently have installed on the server.

Thanks for all replies.

ollm commented 2 years ago

Hi.

With PHP 7.2 I still have the same problem, I have been testing xdebug to see where it happens and the Broken pipe error appears between lines 129 and 139 of https://github.com/reactphp/stream/blob/70d6e15d5f90730651558852c74fbb767fd9215b/src/WritableResourceStream.php although it seems to be more related to the fwrite of line 124: https://github.com/reactphp/stream/blob/70d6e15d5f90730651558852c74fbb767fd9215b/src/WritableResourceStream.php#L124

strace

``` powershell strace: Process 1707488 attached recvfrom(133, "step_into -i 2797\0", 128, 0, NULL, NULL) = 18 getpid() = 1707488 write(4, "[1707488] [Step Debug] <- step_i"..., 44) = 44 getpid() = 1707488 write(4, "[1707488] [Step Debug] ->

xdebug

``` powershell (cmd) step_into 2797 | step_into > break/ok 2797 | file:///home/username/public_html/includes/chat.php:3636 (cmd) 2798 | step_into > break/ok 2798 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsServer.php:76 (cmd) 2799 | step_into > break/ok 2799 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsServer.php:132 (cmd) 2800 | step_into > break/ok 2800 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Messaging/MessageBuffer.php:251 (cmd) 2801 | step_into > break/ok 2801 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Messaging/MessageBuffer.php:252 (cmd) 2802 | step_into > break/ok 2802 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Messaging/MessageBuffer.php:254 (cmd) 2803 | step_into > break/ok 2803 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Handshake/PermessageDeflateOptions.php:167 (cmd) 2804 | step_into > break/ok 2804 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Messaging/MessageBuffer.php:258 (cmd) 2805 | step_into > break/ok 2805 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Messaging/MessageBuffer.php:195 (cmd) 2806 | step_into > break/ok 2806 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Messaging/MessageBuffer.php:198 (cmd) 2807 | step_into > break/ok 2807 | file:///home/username/public_html/vendor/ratchet/rfc6455/src/Messaging/MessageBuffer.php:199 (cmd) 2808 | step_into > break/ok 2808 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsServer.php:154 (cmd) 2809 | step_into > break/ok 2809 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/Http/HttpServer.php:55 (cmd) 2810 | step_into > break/ok 2810 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/Server/IoServer.php:116 (cmd) 2811 | step_into > break/ok 2811 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/Server/IoServer.php:96 (cmd) 2812 | step_into > break/ok 2812 | file:///home/username/public_html/vendor/evenement/evenement/src/Evenement/EventEmitterTrait.php:127 (cmd) 2813 | step_into > break/ok 2813 | file:///home/username/public_html/vendor/evenement/evenement/src/Evenement/EventEmitterTrait.php:134 (cmd) 2814 | step_into > break/ok 2814 | file:///home/username/public_html/vendor/react/stream/src/Util.php:72 (cmd) 2815 | step_into > break/ok 2815 | file:///home/username/public_html/vendor/evenement/evenement/src/Evenement/EventEmitterTrait.php:127 (cmd) 2816 | step_into > break/ok 2816 | file:///home/username/public_html/vendor/evenement/evenement/src/Evenement/EventEmitterTrait.php:134 (cmd) 2817 | step_into > break/ok 2817 | file:///home/username/public_html/vendor/react/stream/src/DuplexResourceStream.php:202 (cmd) 2818 | step_into > break/ok 2818 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:97 (cmd) 2819 | step_into > break/ok 2819 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:96 (cmd) 2820 | step_into > break/ok 2820 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:118 (cmd) 2821 | step_into > break/ok 2821 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:119 (cmd) 2822 | step_into > break/ok 2822 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:123 (cmd) 2823 | step_into > break/ok 2823 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:124 (cmd) 2824 | step_into > break/ok 2824 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:129 (cmd) 2825 | step_into > break/ok 2825 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:139 (cmd) 2826 | step_into > break/ok 2826 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:146 (cmd) 2827 | step_into > break/ok 2827 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:147 (cmd) 2828 | step_into > break/ok 2828 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:150 (cmd) 2829 | step_into > break/ok 2829 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:155 (cmd) 2830 | step_into > break/ok 2830 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:167 (cmd) 2831 | step_into > break/ok 2831 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:97 (cmd) 2832 | step_into > break/ok 2832 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:192 (cmd) 2833 | step_into > break/ok 2833 | file:///home/username/public_html/vendor/react/event-loop/src/Tick/FutureTickQueue.php:42 (cmd) 2834 | step_into > break/ok 2834 | file:///home/username/public_html/vendor/react/event-loop/src/Tick/FutureTickQueue.php:44 (cmd) 2835 | step_into > break/ok 2835 | file:///home/username/public_html/vendor/react/event-loop/src/Tick/FutureTickQueue.php:49 (cmd) 2836 | step_into > break/ok 2836 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:194 (cmd) 2837 | step_into > break/ok 2837 | file:///home/username/public_html/vendor/react/event-loop/src/Tick/FutureTickQueue.php:58 (cmd) 2838 | step_into > break/ok 2838 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:195 (cmd) 2839 | step_into > break/ok 2839 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:196 (cmd) 2840 | step_into > break/ok 2840 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:201 (cmd) 2841 | step_into > break/ok 2841 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:202 (cmd) 2842 | step_into > break/ok 2842 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:208 (cmd) 2843 | step_into > break/ok 2843 | file:///home/username/public_html/vendor/react/event-loop/src/ExtEvLoop.php:162 (cmd) 2844 | step_into > break/ok 2844 | file:///home/username/public_html/vendor/react/event-loop/src/Timer/Timer.php:48 (cmd) 2845 | step_into > break/ok 2845 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsServer.php:210 (cmd) 2846 | step_into > break/ok 2846 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsServer.php:211 (cmd) 2847 | step_into > break/ok 2847 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsConnection.php:31 (cmd) 2848 | step_into > break/ok 2848 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/AbstractConnectionDecorator.php:31 (cmd) 2849 | step_into > break/ok 2849 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsConnection.php:32 (cmd) 2850 | step_into > break/ok 2850 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsServer.php:213 (cmd) 2851 | step_into > break/ok 2851 | file:///home/username/public_html/vendor/cboden/ratchet/src/Ratchet/WebSocket/WsServer.php:215 ```

xdebug and strace with this breakpoints:

breakpoint_set -t line -f file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php -n 129
breakpoint_set -t line -f file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php -n 139
strace

``` powershell write(123, "\27\3\3\0\25\17y\225\267n\303=\271\340\225K\220\354H\317\3624 \2\252\314\2341\301\204v\177\341\324!\23\320\34>", 24) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=1707488, si_uid=0} --- close(123) = 0 write(122, "\27\3\3\0\25\270\313\315\34\\j\216\30\260)\1\16\357{\335bs\2\230\233\"", 26) = 26 getpid() = 1707488 write(4, "[1707488] [Step Debug] ->

xdebug

``` powershell (cmd) 2781 | run > break/ok 2781 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:139 (cmd) 2782 | run > break/ok 2782 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:129 (cmd) 2783 | run > break/ok 2783 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:139 (cmd) 2784 | run > break/ok 2784 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:129 (cmd) 2785 | run > break/ok 2785 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:139 (cmd) 2786 | run > break/ok 2786 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:129 (cmd) 2787 | run > break/ok 2787 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:139 (cmd) 2788 | run > break/ok 2788 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:129 (cmd) 2789 | run > break/ok 2789 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:139 (cmd) 2790 | run > break/ok 2790 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:129 (cmd) 2791 | run > break/ok 2791 | file:///home/username/public_html/vendor/react/stream/src/WritableResourceStream.php:139 ```

I also have this in strace, but I don't know if it is important.

strace

``` powershell write(4, "[1707488] [Step Debug] ->

xdebug

``` powershell (cmd) 2340 | step_into > break/ok 2340 | file:///home/username/public_html/vendor/react/stream/src/DuplexResourceStream.php:185 (cmd) 2341 | step_into > break/ok 2341 | file:///home/username/public_html/vendor/react/stream/src/DuplexResourceStream.php:187 (cmd) 2342 | step_into > break/ok 2342 | file:///home/username/public_html/vendor/react/stream/src/DuplexResourceStream.php:189 (cmd) 2343 | step_into > break/ok 2343 | file:///home/username/public_html/vendor/react/stream/src/DuplexResourceStream.php:195 (cmd) 2344 | step_into > break/ok 2344 | file:///home/username/public_html/vendor/react/stream/src/DuplexResourceStream.php:196 ```

For now I will proceed to install PHP 7.1 to see if the problem persists with this version.

Thank you for your replies.

zhiyong-ft commented 2 years ago

My hunch is this has something to do with resource. Are you running it on a shared host or VPS? I saw cPanel in your Apache info. Also did you try to run garbage collection on connection close?

ollm commented 2 years ago

I am running Ratchet in a VPS (AlmaLinux v8.4.0) with KVM (QEMU), in this same VPS I also have my website running with Apache

uname -a

Linux xxxx-xxxxx.server.name 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Wed Sep 15 11:28:53 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

hostnamectl

   Static hostname: xxxx-xxxxx.server.name
         Icon name: computer-vm
           Chassis: vm
        Machine ID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
           Boot ID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    Virtualization: kvm
  Operating System: AlmaLinux 8.4 (Electric Cheetah)
       CPE OS Name: cpe:/o:almalinux:almalinux:8.4:GA
            Kernel: Linux 4.18.0-305.19.1.el8_4.x86_64
      Architecture: x86-64

Yes, I added gc_collect_cycles(); on the end of onClose function.

I also leave a couple of screenshots of Munin showing sudden CPU usage:

cpu-day cpu-week

Thank for your replie.

zhiyong-ft commented 2 years ago

This sort of things are really hard to track down. If I were you, I would just set up server(s) on a dedicated bare metal box, most likely your headache will just go away. Those from OVH are rather inexpensive.

ollm commented 2 years ago

Thanks for the suggestion, for now I'm going to wait, chat is a fairly secondary function in my application and at the moment it is not a big problem.

Thanks for all replies.

acadjsr commented 2 years ago

Hi ollm,

Have you resolved this? I'm also having this issue with 100% CPU load.

clue commented 2 years ago

@ollm Thanks for looking into this, this definitely shouldn't have happened! I did take a look at your strace and xdebug logs and it looks like the server may indeed be "stuck" on a "dead" connection, continuously trying to write some data on each loop tick and always getting an EPIPE/SIGPIPE in return.

epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1
epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0
write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} ---
epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1
epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0
write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} ---
epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1
epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0
write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} ---

(Note to self: The underlying epoll calls seem to confirm you're correctly using ExtEvLoop on a Linux-based system.)

Translating the above snippet to ReactPHP's logic means the event loop (correctly) identified this connection as writable and notified the outgoing stream to (try to) write some data which reports an error message on the syscall level (all correct for dead connections). As rightfully identified, the stream component is responsible for detecting this error condition and properly closing the socket connection in this case. For some reason, this logic doesn't seem to trigger, so the stream accepts an empty write and retries in the next loop iteration ad infinitum, thus causing 100% CPU usage.

It looks like this issue might be related to my changes introduced in https://github.com/reactphp/stream/pull/150 in response to https://github.com/reactphp/stream/pull/149. In this case, it would be helpful for further debugging if you could dump the contents of $sent and $error before this line: https://github.com/reactphp/stream/blob/70d6e15d5f90730651558852c74fbb767fd9215b/src/WritableResourceStream.php#L139

We have reproducible test cases for the existing behavior and I would love to come up with a reproducible test case for your specific error. The test suite contains a number of tests for broken pipes on various platforms and before seeing this very ticket here, I would have expected this to be a non-issue. In either case, thanks for reporting this high quality ticket!

zhiyong-ft commented 2 years ago

@clue this sounds very serious. Should we revert back to certain commit for production servers while you are working on this issue?

clue commented 2 years ago

@zhiyong-ft No reason to panic. To the best of our knowledge this is a very rare error condition that does not seem to happen very often. The PRs linked above have been merged more than a year ago and we are not currently aware of any other reports running into this, despite ReactPHP being used in production in countless applications.

If this condition occurs, you will see load spikes, but the server should stay responsive otherwise (there's no crash or known DoS afaict). If you manage to run into this, please help us identify the root cause (see previous message) and I'm happy to look into this. Reverting to an old version doesn't seem sensible at this point as the previous version had other known issues (again linked in previous message) that are in turn known to be fixed in the current version.

ollm commented 2 years ago

It looks like this issue might be related to my changes introduced in https://github.com/reactphp/stream/pull/150 in response to https://github.com/reactphp/stream/pull/149. In this case, it would be helpful for further debugging if you could dump the contents of $sent and $error before this line: https://github.com/reactphp/stream/blob/70d6e15d5f90730651558852c74fbb767fd9215b/src/WritableResourceStream.php#L139

@clue I've prepared the code so that I can dump $sent and $error when the overload occurs again, but on my server it can take days/weeks until it happens (The last time was 3 days ago), so I don't know when I'll have the results.

acadjsr commented 2 years ago

I'm having the same issue.

This is my strace: --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2439225, si_uid=1000} --- epoll_wait(4, [{EPOLLOUT|EPOLLHUP, {u32=18, u64=43499428773906}}], 64, 4650) = 1 epoll_ctl(4, EPOLL_CTL_MOD, 18, {EPOLLOUT, {u32=18, u64=43499428773906}}) = 0 write(18, "\27\3\3\0\25\334\316\177\302O\325\327\322)\2\204\5\227\234\336\366\354\355\342E\5", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2439225, si_uid=1000} --- epoll_wait(4, [{EPOLLOUT|EPOLLHUP, {u32=18, u64=43499428773906}}], 64, 4650) = 1 epoll_ctl(4, EPOLL_CTL_MOD, 18, {EPOLLOUT, {u32=18, u64=43499428773906}}) = 0 write(18, "\27\3\3\0\25\334\316\177\302O\325\327\322)\2\204\5\227\234\336\366\354\355\342E\5", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2439225, si_uid=1000} --- epoll_wait(4, [{EPOLLOUT|EPOLLHUP, {u32=18, u64=43499428773906}}], 64, 4650) = 1 epoll_ctl(4, EPOLL_CTL_MOD, 18, {EPOLLOUT, {u32=18, u64=43499428773906}}) = 0 write(18, "\27\3\3\0\25\334\316\177\302O\325\327\322)\2\204\5\227\234\336\366\354\355\342E\5", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2439225, si_uid=1000} --- epoll_wait(4, [{EPOLLOUT|EPOLLHUP, {u32=18, u64=43499428773906}}], 64, 4650) = 1 epoll_ctl(4, EPOLL_CTL_MOD, 18, {EPOLLOUT, {u32=18, u64=43499428773906}}) = 0 write(18, "\27\3\3\0\25\334\316\177\302O\325\327\322)\2\204\5\227\234\336\366\354\355\342E\5", 26) = -1 EPIPE (Broken pipe)

Linux: RedHat 8.5 Event loop: ExtEvLoop (This was happening with ExtEventLoop and StreamSelectLoop too.)

composer show -i cboden/ratchet v0.4.4 PHP WebSocket library evenement/evenement v3.0.1 Événement is a very simple event dispatching library for PHP guzzlehttp/psr7 2.2.1 PSR-7 message implementation that also provides common utility methods psr/http-factory 1.0.1 Common interfaces for PSR-7 HTTP message factories psr/http-message 1.0.1 Common interface for HTTP messages ralouphie/getallheaders 3.0.3 A polyfill for getallheaders. ratchet/rfc6455 v0.3.1 RFC6455 WebSocket protocol handler react/cache v1.1.1 Async, Promise-based cache interface for ReactPHP react/dns v1.9.0 Async DNS resolver for ReactPHP react/event-loop v1.3.0 ReactPHP's core reactor event loop that libraries can use for evented I/O. react/promise v2.9.0 A lightweight implementation of CommonJS Promises/A for PHP react/promise-timer v1.8.0 A trivial implementation of timeouts for Promises, built on top of ReactPHP. react/socket v1.11.0 Async, streaming plaintext TCP/IP and secure TLS socket server and client connections for ReactPHP react/stream v1.2.0 Event-driven readable and writable streams for non-blocking I/O in ReactPHP react/zmq v0.4.0 ZeroMQ bindings for React. symfony/deprecation-contracts v2.5.1 A generic function and convention to trigger deprecation notices symfony/http-foundation v5.4.6 Defines an object-oriented layer for the HTTP specification symfony/polyfill-mbstring v1.25.0 Symfony polyfill for the Mbstring extension symfony/polyfill-php80 v1.25.0 Symfony polyfill backporting some PHP 8.0+ features to lower PHP versions symfony/routing v5.4.3 Maps an HTTP request to a set of configuration variables

PHP PHP 7.2.24 (cli) (built: Oct 22 2019 08:28:36) ( NTS ) Copyright (c) 1997-2018 The PHP Group Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies with Zend OPcache v7.2.24, Copyright (c) 1999-2018, by Zend Technologies

acadjsr commented 2 years ago

could dump the contents of $sent and $error before this line: https://github.com/reactphp/stream/blob/70d6e15d5f90730651558852c74fbb767fd9215b/src/WritableResourceStream.php#L139

I added this line: echo "sent >".json_encode($sent)."< error >".json_encode($error)."<\n";

Before this line: if (($sent === 0 || $sent === false) && $error !== null) {

The output when CPU usage is 100% : sent >0< error >null< sent >0< error >null< sent >0< error >null< .... sent >0< error >null<

clue commented 2 years ago

@acadjsr Thanks for providing these details, this definitely helps us pinpoint this issue!

I added this line: echo "sent >".json_encode($sent)."< error >".json_encode($error)."<\n";

Before this line: if (($sent === 0 || $sent === false) && $error !== null) {

The output when CPU usage is 100% :+1: sent >0< error >null<

This is interesting and confirms my suspicions that the underlying write() syscall fails and reports and error, however PHP's fwrite() function doesn't actually report an error in this case. Because PHP doesn't report an error to us, our logic fails and assumes the stream is still valid, thus ending up repeatedly retrying to write data, until the stream is closed some other way (timeouts or process restarts).

I've looked into this and have some ideas how this could be addressed, but still have a hard time trying to come up with a reproducible test case for this scenario. It looks like you've been able to reproduce this quite easily, so perhaps you can help us figure out how to repeat this process?

From my understanding this logic triggers only in this specific case:

We do in fact have tests for this specific scenario in place and so far I'm unable to reproduce this locally. So that leaves me wondering what's the missing piece here?

acadjsr commented 2 years ago

I can not reproduce it. It is occurring in my production server every 2 days or sooner.

zhiyong-ft commented 2 years ago

trying to write data to a connection that is already closed on the remote side (see EPIPE error) and

@clue do you have a way to make a connection close on client side but not on server? My experience is this is not easy to accomplish. WS test clients like the plugin for Chrome will always gracefully close open connections before shut down. One way would be to either kill the test client process or unplug the power of computer test client is running on.

If you are familiar with embedded system, I can also provide a embedded ws client, it is a lot easier and faster to ungracefully shut down a client and then start again.

WyriHaximus commented 2 years ago

a kill -9 could do the trick

zhiyong-ft commented 2 years ago

@WyriHaximus is this command for Linux? End task in task manager on Windows 11 doesn't work. Connection on server will still be closed(presumably requested by OS). So far the only reliable way on my hands is power off my embedded system.

WyriHaximus commented 2 years ago

@zhiyong-ft That's for Linux, that will kill the process hard without giving it time to clean up. Which hopefully has the same effect as powering off your embedded system by pulling the plug

ollm commented 2 years ago

These are the values of $send and $error when the process is at 100% CPU, they are the same values as @acadjsr

$sent: int(0)
$error: NULL

$sent: int(0)
$error: NULL

$sent: int(0)
$error: NULL

I've tried to kill the local websocket created with wsstat or wscat, but I can't reproduce the error in the server, I've also tried to disconnect the network connection while the websocket in the browser is connected (Also wsstat and wscat), but it doesn't work either.

zhiyong-ft commented 2 years ago

@WyriHaximus @ollm just tried on ubuntu 20.04. kill -9 <PID> Once node/wscat is killed, connection is also closed on server side. I track the open/close time and a bunch of other stuff of all ws connections. So you will have to try something else to test this. Like rip out battery of laptop then unplug power cord. Or send me the URL, I can help test with embedded client.

WyriHaximus commented 2 years ago

Bummer, looking for a way to reproduce this with software, without the need to hard pull a device.

zhiyong-ft commented 2 years ago

I think sockets are most likely managed by OS itself. Once a process that uses a socket is killed, OS will just close underlining socket by default.

@WyriHaximus see if you can get some embedded system, such as those STM32 nucleo boards. I will be happy to help you guys setup a test client. You push a button to reset the embedded system, now you have a orphan connection on your server.

Or you can send me you ws/wss URL, I can connect to your server and reset.

Bottomline, I am more than happy to help you guys test and fix this. Having such issue wakes me up in the middle of night.

clue commented 2 years ago

trying to write data to a connection that is already closed on the remote side (see EPIPE error) and

@clue do you have a way to make a connection close on client side but not on server? My experience is this is not easy to accomplish. WS test clients like the plugin for Chrome will always gracefully close open connections before shut down. One way would be to either kill the test client process or unplug the power of computer test client is running on.

@zhiyong-ft Just to be clear, are you saying you can reproduce this error condition consistently by forcefully closing the client connection (i.e. disconnecting without a proper socket shutdown / sending FIN)?

zhiyong-ft commented 2 years ago

@clue I can reliably turn off ws client while keeping socket open on server side for several hours. I didn't try to reproduce the issue itself. Don't have the brain bandwidth at the moment. Have been pulling all nights on some other stuff :)

clue commented 2 years ago

@clue I can reliably turn off ws client while keeping socket open on server side for several hours. I didn't try to reproduce the issue itself.

@zhiyong-ft I'm not sure I follow exactly what you're trying to achieve, but it sounds like this would be unrelated to this ticket. If you're seeing any other error or have any other questions, perhaps it makes sense to file this as a separate ticket?

zhiyong-ft commented 2 years ago

@clue actually I think it is relevant. You mentioned that having a connection/socket open on server side but closed on client side is a requirement for this issue to happen. From my experience, this(socket open on server side but closed on client side) is not easy to achieve on most common setup, hence the issue doesn't happen that often. @WyriHaximus thought he could do it by forcefully close client process, but that actually doesn't achieve this. Essentially, what I am trying to offer is a way to help reproduce the issue, not an issue itself therefore doesn't warrant a separate ticket.

acadjsr commented 2 years ago

I put this code to capture what is being written in this case. Will let you know when this occur.

        if (($sent == 0) && ($error == null))
        {
            $this->emit('error', array(new \RuntimeException("DEBUG: Unable to write to stream with data = '" . json_encode($this->data) . "' and writeChunkSize = '" . json_encode($this->writeChunkSize) . "'")));
        }
clue commented 2 years ago

@clue actually I think it is relevant. You mentioned that having a connection/socket open on server side but closed on client side is a requirement for this issue to happen. From my experience, this(socket open on server side but closed on client side) is not easy to achieve on most common setup, hence the issue doesn't happen that often. […]

@zhiyong-ft I think you might have misread me and I don't currently see how this would be related to the issue. A client that closes the connection is a pretty normal thing and the fact that the server detects this some time later is part of normal operation. This is also covered by our automated test suite and covered by years of production use of ReactPHP, so we're not aware of any issues in this regard.

From my understanding this logic triggers only in this specific case:

  • trying to write data to a connection that is already closed on the remote side (see EPIPE error) and

  • reading from the connection is paused (event loop only reports stream as writable and doesn't check for readability, possible due to end() call?) and

  • encrypted TLS connections (see binary data header and syscall write() vs sendto()) and

  • some other unknown condition?

Self-quoting here to emphasize the first point is not problematic, but also not too common. Usually, a client connection would be readable at all times, so the server would detect a disconnect as an empty read and then trigger a close event automatically. The above debugging output suggests the client is not in a readable state however. This is not problematic at all, but still an interesting insight because it would only be triggered by a previous pause() or end() call. This means the server has no way to detect the disconnected client until it executes the next socket operation. In this instance, the outgoing write should fail (as expected) and then trigger a close event. We have a number of test cases that verify this behavior, so again this is known to work.

However, after analyzing the above debugging output, we can see that the underlying write() reports an error (expected), but PHP doesn't report this error to us (unexpected). As a result, our logic keeps retrying and thus causing high CPU load. We have yet to figure out what is causing this behavior. Once we have a better understanding of this situation, a permanent fix should be rather straight forward.

If you can see unreasonably high CPU usage and have a way to reproduce this (even if it doesn't happen 100% and no matter how complex this setup might be), by all means, please report back and I'm happy to help track down the underlying issue.

zhiyong-ft commented 2 years ago

@clue thanks for taking the time to explain, I don't fully understand the issue so I will take your word. Hopefully my comments were not distracting.

Still I will further comment on your statement below.

Self-quoting here to emphasize the first point is not problematic, but also not too common.

In my setup, this is actually fairly common. When those small IoT devices connect to WS server via unreliable networks, this happens on a daily basis. Then compound this with 10s of thousands of devices con-currently connect to server. You get the idea. This is usually a job offloaded to AWS/Azure IoT gateway or MQTT server which is more or less the same as WS. But in specific circumstance, Ratchet is preferred.

I will try to help somehow once I get out my current bind.

acadjsr commented 2 years ago

I put this code to capture what is being written in this case. Will let you know when this occur.

        if (($sent == 0) && ($error == null))
        {
            $this->emit('error', array(new \RuntimeException("DEBUG: Unable to write to stream with data = '" . json_encode($this->data) . "' and writeChunkSize = '" . json_encode($this->writeChunkSize) . "'")));
        }

This has occurred. Here is the output: 2022/04/22 05:58:31 - An error has occurred: DEBUG: Unable to write to stream with data = '' and writeChunkSize = '-1'

So it seems that $this->data is empty string so $sent is zero. No error has occurred.

acadjsr commented 2 years ago

After above has occurred I changed the code to capture if data is ''. Like this:

        if (($sent == 0) && ($error == null))
        {
            $this->emit('error', array(new \RuntimeException("DEBUG: Unable to write to stream with data = '" . json_encode($this->data) . "' and writeChunkSize = '" . json_encode($this->writeChunkSize) . "' where data === '' is ".(($this->data === '') ? "true" : "false")." and data == '' is ".(($this->data == '') ? "true" : "false"))));
        }

Here are the results, (note the bold):

2022/04/23 22:47:37 - An error has occurred: DEBUG: Unable to write to stream with data = '' and writeChunkSize = '-1' where data === '' is false and data == '' is false

So the issue is that the data is not a string or empty string so bellow code never occur.

if ($this->data === '') {
...
}

Edit: I just put var_dump($this->data); to capture more info about it. Will let you know the results.

clue commented 2 years ago

@clue thanks for taking the time to explain, I don't fully understand the issue so I will take your word. Hopefully my comments were not distracting.

@zhiyong-ft Dont worry. Please report back if this turns out to be related to this issue and I'm happy to look into this again!

So it seems that $this->data is empty string so $sent is zero. No error has occurred. […] 2022/04/23 22:47:37 - An error has occurred: DEBUG: Unable to write to stream with data = '' and writeChunkSize = '-1' where data === '' is false and data == '' is false

@acadjsr It looks like this might be related to how you're using json_encode() which might return a (bool) false which would then cast to an (string) "". This happens if the data can not be encoded (in particular for binary data). Can you update this to use JSON_INVALID_UTF8_SUBSTITUTE or bin2hex() or base64_encode()? The output of stream_get_meta_data() and feof() might also be interesting.

TechOverflow commented 2 years ago

I am also experiencing this problem, but CPU load is fine. Not sure if related. Instead, I have set up a very simple reply on the server. When client sends a message, the server looks into the database and replies. The problem is that once there are multiple clients (50+), the reply takes anywhere from 0 to 50 seconds. If I send multiple messages from a test client, there is no reply at first, but then all of a sudden the replies to all messages come as one burst. It is very odd. The delay is not happening due to the database. It's as if the message buffer gets stuck. I fairly often get Unable to write to stream with data = '' and writeChunkSize = '-1' using the code above.

TechOverflow commented 2 years ago

And with JSON_INVALID_UTF8_SUBSTITUTE I get Error Exception: Unable to write to stream with data = '"\ufffd\u0002\u0003\ufffd"' and writeChunkSize = '-1'

clue commented 2 years ago

I am also experiencing this problem, but CPU load is fine. Not sure if related.

@TechOverflow I don't think this is related, as this specific instance here deals with high CPU load due to failing writes on the socket layer. Can you file your problem as a separate ticket? Thank you.

TechOverflow commented 2 years ago

Hi @clue, Yes it was a mistake in my code. Sorry for the clutter.

acadjsr commented 2 years ago

So it seems that $this->data is empty string so $sent is zero. No error has occurred. […] 2022/04/23 22:47:37 - An error has occurred: DEBUG: Unable to write to stream with data = '' and writeChunkSize = '-1' where data === '' is false and data == '' is false

@acadjsr It looks like this might be related to how you're using json_encode() which might return a (bool) false which would then cast to an (string) "". This happens if the data can not be encoded (in particular for binary data). Can you update this to use JSON_INVALID_UTF8_SUBSTITUTE or bin2hex() or base64_encode()? The output of stream_get_meta_data() and feof() might also be interesting.

I have the javascript code (that executes in user browser) that checks for disconnected websocket and tries to connect if disconnected. A week ago I changed it to do this only when page is visible and I did not had this error for a week.

if (document.visibilityState === 'visible')

I figured out that something is wrong, I never had this long period without an error. So last night I reverted it back like it was and sure enough this morning I had 3 errors "Unable to write to stream" waiting for me. All 3 are the same. So here is the error output:

==========================
gettype: string
mb_detect_encoding: 
strlen: 4

Warning: mb_strlen(): Unknown encoding "" in ***************************/vendor/react/stream/src/WritableResourceStream.php on line 145
mb_strlen: 
bin2hex: 880203e8
base64_encode: iAID6A==
stream_get_meta_data: {"crypto":{"protocol":"UNKNOWN","cipher_name":"TLS_AES_256_GCM_SHA384","cipher_bits":256,"cipher_version":"TLSv1.3"},"timed_out":false,"blocked":false,"eof":true,"stream_type":"tcp_socket\/ssl","mode":"r+","unread_bytes":0,"seekable":false}
feof: true
2022/05/02 02:04:35 - An error has occurred: DEBUG: Unable to write to stream with data = '"\ufffd\u0002\u0003\ufffd"' and writeChunkSize = '-1' where data === '' is false and data == '' is false

==========================

Here is the code:

        if (($sent == 0) && ($error == null))
        {
            echo "\n\n\n==========================\n";
            echo "gettype: ".gettype($this->data)."\n";
            if (gettype($this->data) == "string")
            {
                echo "mb_detect_encoding: ".mb_detect_encoding($this->data)."\n";
                echo "strlen: ".strlen($this->data)."\n";
                echo "mb_strlen: ".mb_strlen($this->data, mb_detect_encoding($this->data))."\n";
            }
            echo "bin2hex: ".bin2hex($this->data)."\n";
            echo "base64_encode: ".base64_encode($this->data)."\n";
            $meta = \stream_get_meta_data($this->stream);
            echo "stream_get_meta_data: ".json_encode($meta, JSON_INVALID_UTF8_SUBSTITUTE)."\n";
            echo "feof: ".json_encode(\feof($this->stream))."\n";
            $this->emit('error', array(new \RuntimeException("DEBUG: Unable to write to stream with data = '" . json_encode($this->data, JSON_INVALID_UTF8_SUBSTITUTE) . "' and writeChunkSize = '" . json_encode($this->writeChunkSize) . "' where data === '' is ".(($this->data === '') ? "true" : "false")." and data == '' is ".(($this->data == '') ? "true" : "false"))));
            echo "\n==========================\n";
        }
clue commented 2 years ago

[…] So here is the error output:

@acadjsr Thank you very much, this definitely helps tracking this down!

feof: true

This is good news because this suggests we should be able to detect this error condition quite easily. While the underlying syscall reports an error when writing to a closed stream (expected), PHP doesn't report this error to us (unexpected), but PHP at least reports the stream as closed. This means that inside ReactPHP we should be able to raise an error manually in this case and gracefully close the local side of the connection as well.

bin2hex: 880203e8

This is interesting because it suggests the server is trying to send a normal close message to the client (OP_CLOSE = 8 with CLOSE_NORMAL = 1000). This also seems to be in line with everything that's been reported as part of this ticket so far.

I figured out that something is wrong, I never had this long period without an error. So last night I reverted it back like it was and sure enough this morning I had 3 errors "Unable to write to stream" waiting for me

Interesting! With the above debugging, it looks like this might indeed be related to connections that are about to close. It's currently my understanding the client is closing the connection by informing the server which in turn wants to send a close frame to the client which is no longer possible at this point. The server should detect this situation and abort the outgoing write and just consider the connection dead at this point, but for seem reason ends up in a loop because PHP doesn't report the write as failing. The above debugging output also suggests this only happens for TLS connections and probably works just fine for plaintext connections.

I'm still trying to find an easy way to reproduce this locally to make sure a possible patch fixes this once and for all. Does the above help reproduce the problem?

acadjsr commented 2 years ago

I can not reproduce this.

fx1234 commented 2 years ago

We are experiencing the same problem

load-day

there is nothing else that stands out. no errors or anything like that. the server continues to accept connections

Operating System: Debian GNU/Linux 11 (bullseye) PHP: PHP 7.4.28 (cli) (built: Feb 17 2022 16:17:19) Loop: object(React\EventLoop\ExtEvLoop)

acadjsr commented 2 years ago

I fixed this by reverting this commit: https://github.com/reactphp/stream/pull/150/commits/3eb342d87ca89e0c4c7c428505f36051c172f677

fx1234 commented 2 years ago

Can anybody confirm this? I don't have enough understanding of reactphp for this. we would like to use ratchet productively... but at the moment we have to restart the server 2-3 times a day

what do I have to do to test what @acadjsr wrote? replace the green-lines (right) with the red-lines (left)?

clue commented 2 years ago

I fixed this by reverting this commit: reactphp/stream@3eb342d

@acadjsr Reverting to an old version doesn't seem sensible at this point as the previous version had other known issues that are in turn known to be fixed in the current version (see above https://github.com/ratchetphp/Ratchet/issues/939#issuecomment-1098798274).

Can anybody confirm this? […]

@fx1234 This may fix this specific error but may cause other issues (see above), so I do not recommend this for production use. Perhaps you can help us pinpoint this issue instead.

I'm still trying to find an easy way to reproduce this locally to make sure a possible patch fixes this once and for all. Can anybody reliably reproduce the problem so far with the above instructions?

fx1234 commented 2 years ago

Unfortunately I don't think I can be of any help here.

I don't see any pattern. Sometimes it happens after 12 hours, sometimes after 30 minutes. Sometimes 1000+ clients are connected, sometimes it's only 100