ratchetphp / Ratchet

Asynchronous WebSocket server
http://socketo.me
MIT License
6.28k stars 743 forks source link

Ratchet on 100% CPU load #939

Open ollm opened 2 years ago

ollm commented 2 years ago

Hi.

During the last month, I am having a problem with running a Ratchet server in a production environment, suddenly the process goes from normally using 0-1% of a CPU core to 98-100%, the websocket server still working but it takes longer to respond, this happens in a variety of ways, sometimes it takes a few hours to happen and others it takes days / weeks.

When it enters this state it does not exit again until I kill the process and supervisor starts a new one.

I am also checking the logic of my application so that it is not a problem on my part.

Apache version

Server Version: Apache/2.4.51 (cPanel) OpenSSL/1.1.1g mod_bwlimited/1.4
Server MPM: event
Server Built: Nov 7 2021 17:32:31

PHP version

PHP 8.0.12 (cli) (built: Nov 10 2021 01:18:26) ( NTS )
Copyright (c) The PHP Group
Zend Engine v4.0.12, Copyright (c) Zend Technologies
    with Zend OPcache v8.0.12, Copyright (c), by Zend Technologies

Ratchet version being used

cboden/ratchet v0.4.4 PHP WebSocket library
├──guzzlehttp/psr7 v1.8.3
├──ratchet/rfc6455 v0.3.1
├──react/event-loop v1.2.0
├──react/socket v1.10.0
├──symfony/http-foundation v6.0.1
└──symfony/routing v4.4.34

Event loop

ExtEvLoop
ev (1.1.4)

supervisor.conf

[unix_http_server]
file = /tmp/supervisor.sock

[supervisord]
logfile          = ./logs/supervisord.log
logfile_maxbytes = 50MB
logfile_backups  = 10
loglevel         = info
pidfile          = /tmp/supervisord.pid
nodaemon         = false
minfds           = 4096
minprocs         = 200

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl = unix:///tmp/supervisor.sock

[program:ratchet]
directory               = /home/username/public_html
command                 = bash -c "ulimit -n 10000; exec php chat.php"
process_name            = Chat
numprocs                = 1
autostart               = true
autorestart             = true
user                    = root
stdout_logfile          = ./logs/chat_info.log
stdout_logfile_maxbytes = 1MB
stderr_logfile          = ./logs/chat_error.log
stderr_logfile_maxbytes = 1MB

I have this code in the app, but there is no record in chat_info.log or chat_error.log

public function onError(ConnectionInterface $conn, \Exception $e)
{
    echo "An error has occurred: {$e->getMessage()}\n";

    $conn->close();
}

This is what appears if I run strace in the pid

strace

``` --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10400) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10399) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10398) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3517427, si_uid=0} --- epoll_wait(6, [{EPOLLOUT|EPOLLHUP, {u32=144, u64=3702261809296}}], 680, 10397) = 1 epoll_ctl(6, EPOLL_CTL_MOD, 144, {EPOLLOUT, {u32=144, u64=3702261809296}}) = 0 write(144, "\27\3\3\0\25\261\3\360_\3138\365>\260H\207\321X\226\361\206\240v\326:\351", 26) = -1 EPIPE (Broken pipe) ```

Thank you for your replies.

acadjsr commented 2 years ago

I fixed this by reverting this commit: reactphp/stream@3eb342d

@acadjsr Reverting to an old version doesn't seem sensible at this point as the previous version had other known issues that are in turn known to be fixed in the current version (see above #939 (comment)).

I missed to mention that I reverted only this file: WritableResourceStream.php So I'm not having this issue for 4 weeks.

fx1234 commented 2 years ago

We made the changes as suggested by @acadjsr and this is the first time we haven't had to restart the server in over 24 hours. So the problem is definitely to be found there

zhiyong-ft commented 2 years ago

@fx1234 @acadjsr what do you see in system logs when high CPU usage happens?

One of my dev server that runs rachet somehow started to have high CPU usage (50%, 1 core out of two completely used), and high ram usage (near 100%), not responding to remote SSH connection request. In kern.log I see an entry with many NULL with timestamp cooresponding to high usage of CPU/RAM.

At this point I am not sure this is caused by rachet, but rachet is the only service on this server. Is there any way to rule out?

Thanks.

acadjsr commented 2 years ago

I don't know. It did not happen even once in last 5 months in production after the solution: https://github.com/ratchetphp/Ratchet/issues/939#issuecomment-1132159140

zhiyong-ft commented 2 years ago

@acadjsr Thanks for the information. I think I too will apply the fix before putting it into production.

When high CPU issue happened, were you able to log in via SSH? Were other service such as Apache still responsive?

fx1234 commented 2 years ago

The same for me as for @acadjsr. I think the PHP process was at 100%, but I can't tell exactly

acadjsr commented 2 years ago

@acadjsr Thanks for the information. I think I too will apply the fix before putting it into production.

When high CPU issue happened, were you able to log in via SSH? Were other service such as Apache still responsive?

I was able to login in via SSH. Other services are still responsive.

clint0s commented 1 year ago

I'm really struggling with this issue too. Randomly (it seems) every 3 days to a month PHP consumes an entire CPU core (running Ratchet/ReactPHP). strace shows:

--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=995813, si_uid=1000} ---
pselect6(91, [5 7 15 36 45], [90], [], {tv_sec=29, tv_nsec=604598000}, NULL) = 1 (out [90], left {tv_sec=29, tv_nsec=604596023})
write(90, "\313\301\20$\226RuKn\263\241\336\30\234\2271\177\220K\21\v\340\245<r\7\355\261\237\26\rG"..., 3190) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=995813, si_uid=1000} ---
pselect6(91, [5 7 15 36 45], [90], [], {tv_sec=29, tv_nsec=604355000}, NULL) = 1 (out [90], left {tv_sec=29, tv_nsec=604353252})
write(90, "\313\301\20$\226RuKn\263\241\336\30\234\2271\177\220K\21\v\340\245<r\7\355\261\237\26\rG"..., 3190) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=995813, si_uid=1000} ---
pselect6(91, [5 7 15 36 45], [90], [], {tv_sec=29, tv_nsec=604117000}, NULL) = 1 (out [90], left {tv_sec=29, tv_nsec=604115286})
write(90, "\313\301\20$\226RuKn\263\241\336\30\234\2271\177\220K\21\v\340\245<r\7\355\261\237\26\rG"..., 3190) = -1 EPIPE (Broken pipe)

I can never tie this down to any event in my application's log. The websocket server application is still working while this happens but it is very slow to respond. The only solution is a restart of the Ratchet process, which is highly disruptive.

Edit: Under normal circumstances the process consumes < 5% CPU, even at busy times. This condition started happening around about a year ago in June 2022 (probably when I ran a composer update) - previous to that it never happened once.

acadjsr commented 1 year ago

Have you tried to revert this commit: https://github.com/reactphp/stream/commit/3eb342d87ca89e0c4c7c428505f36051c172f677

My Ratchet is rock solid and never failed (for more than a year) after the revert.

clint0s commented 1 year ago

Thanks @acadjsr I'll give it a go

SimonFrings commented 1 year ago

@clint0s Thanks for the additional input on this :+1:

Still working together with @clue on a solution for this, already spent multiple hours and we have yet to figure out a way to reproduce this issue.

Also note @clue's comment about reverting to an old version above:

Reverting to an old version doesn't seem sensible at this point as the previous version had other known issues that are in turn known to be fixed in the current version (see above https://github.com/ratchetphp/Ratchet/issues/939#issuecomment-1098798274).