swoole / swoole-src

🚀 Coroutine-based concurrency library for PHP
https://www.swoole.com
Apache License 2.0
18.27k stars 3.16k forks source link

swManager_check_exit_status signal=9 error #2018

Closed ozgurhangisi closed 5 years ago

ozgurhangisi commented 5 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a simple script for reproducing the error.

Swoole server works well for a while. I don't know why but It immediately stop responding. I checked CPU usage, memory, open File descriptor count,request count,etc... it seems like nothing is wrong. I get the errors below :

[2018-10-02 01:03:38 $1972.0] WARNING swManager_check_exit_status: worker#0 abnormal exit, status=0, signal=9 Error. After this error I get many WARNING swManager_check_exit_status: worker#0 abnormal exit, status=0, signal=11 errors For around 3 hours. The last message is WARNING swServer_signal_hanlder: Fatal Error: manager process exit. status=0, signal=9.

When I restart the server everything is normal for 3-4 days and it happens again.

  1. What did you expect to see?

I expect swoole server to run properly.

  1. What did you see instead?

Error messages in the log file.

  1. What version of Swoole are you using (show your php --ri swoole)?

swoole support => enabled Version => 4.2.1 Author => Swoole Group[email: team@swoole.com] coroutine => enabled trace-log => enabled epoll => enabled eventfd => enabled signalfd => enabled cpu affinity => enabled spinlock => enabled rwlock => enabled sockets => enabled openssl => enabled http2 => enabled pcre => enabled zlib => enabled mutex_timedlock => enabled pthread_barrier => enabled futex => enabled mysqlnd => enabled redis client => enabled

Directive => Local Value => Master Value swoole.enable_coroutine => On => On swoole.aio_thread_num => 2 => 2 swoole.display_errors => On => On swoole.use_namespace => On => On swoole.use_shortname => On => On swoole.fast_serialize => Off => Off swoole.unixsock_buffer_size => 8388608 => 8388608

  1. What is your machine environment used (including version of kernel & php & gcc) ?

AWS Linux t2.micro with 1 GB RAM.

ghost commented 5 years ago

Hmm.. My swoole server runs same time to exactly test such things, but there were no problems.

Thanks for that, i will have a longer run :+1:

btw: https://wiki.swoole.com/wiki/page/p-LinuxSignal.html https://wiki.swoole.com/wiki/page/172.html

ozgurhangisi commented 5 years ago

Thanks for the link. It seems like error related with file descriptor (according to the link you sent). It generally happens at night between 03:00 AM to 06:00 AM. We don't have so much traffic at night. I checked open file count on swoole server with lsof | wc -l command. At pick time it's around 2000. But when we had problem it was 1413. I have 3 swoole server. 3 of them gets (almost) equal traffic from load balancer but some of them stops in 2 days. Some of them stops in 1 week. It's not a stable error and everything is look like normal on the server.

We have ~60 nginx php-fpm server and we are trying to convert them to swoole but we couldn't solve this problem and we couldn't find the reason.

ozgurhangisi commented 5 years ago

By the way this error always happens on swoole servers. If you want me to check something on the broken server so I can send the information you need.

ghost commented 5 years ago

If it is possible to you, this issue template https://github.com/swoole/swoole-src/issues/2000 is for such cases to help to provide neccessary information and get faster support by swoole team :)

twose commented 5 years ago

signal 9 is SIGKILL, it's not a bug, probably your mistake. if any other process kills your swoole manager server? check it, it may be your code error.

ozgurhangisi commented 5 years ago

Hi,

In the code there is no linux command signal send. It just connect to memcached, redis, db, etc... and sends the results.

It's a standart aws server I only installed php72 and some extensions. I will check if any other process send SIGKILL to swoole.

Thanks.

ghost commented 5 years ago

Can we catch it? Maybe, this would help?

https://wiki.swoole.com/wiki/page/362.html

twose commented 5 years ago

@flddr

There are two signals which cannot be intercepted and handled: SIGKILL and SIGSTOP. https://en.wikipedia.org/wiki/Signal_(IPC)#SIGKILL

if you can catch it and ignore it, you may never kill this process.

ozgurhangisi commented 5 years ago

I checked my code again and there is no signal command. Here is how I start the swoole server. May be you can see anything that I should fix :

$this->wisObjects['webserver']['obj'] = new swoole_http_server('127.0.0.1', 80); $this->wisObjects['webserver']['obj']->set(['log_file'=>'/var/log/wisswoole']); $this->wisObjects['webserver']['obj']->set(['worker_num'=>1]); $this->wisObjects['webserver']['obj']->set(['open_tcp_nodelay'=>true]); $this->wisObjects['webserver']['obj']->set(['daemonize'=>2]); swoole_async_set([ "enable_reuse_port" => true, ]); $this->wisObjects['webserver']['obj']->on('request',[$this,'httpRequest']); $this->wisObjects['webserver']['obj']->on('workerstart',[$this,'httpWorkerStart']); $this->wisObjects['webserver']['obj']->on('workerstop',[$this,'httpWorkerStop']); $this->wisObjects['webserver']['obj']->start();

ozgurhangisi commented 5 years ago

I will install same code to the machine with 2GB RAM with 2 CPU. it currently runs on the machine 1GB RAM with 1 CPU. Our system gets many traffic. So may be RAM or CPU is not enough for the processes.

I will let you know in a week if it's related with server's RAM or CPU.

Swoole is a life saver solution for the companies who work with php and gets thousands of request in second. Our performance test results on swoole server are so good and I really want to use it. Sorry for disturbing you too much and thanks for the fast responses. We love Swoole :)

twose commented 5 years ago

@ozgurhangisi @flddr sorry, I have not noticed that there is signa11, Is your swoole is the latest version? like @flddr says, read https://github.com/swoole/swoole-src/issues/2000#issuecomment-423807053 trace your core file or try to use valgrind. Due to insufficient English documentation, I know many of you are using asynchronous APIs, but it's not the most advanced way, asynchronous's stability is slightly worse than coroutine, I just wanted to let you know that, I guess this is a possible reason.

ozgurhangisi commented 5 years ago

Swoole version is 4.2.1 Yes there is many signal=11 in the log file but it starts with signal=9 and continue with signal=11 errors. I see same things in the log file in 3 swoole server.

I installed swoole with yum remi package but I will try to do the steps in the link.

ghost commented 5 years ago

@ozgurhangisi i guess signal 9 is done by os because of ressources, while signal 11 is segfault, which is interesting for twose for correct handling. So if you follow #2000 for coredump, they can preserve it in this special case :+1:

ozgurhangisi commented 5 years ago

Hi,

Good news, It's been 6 days and swoole servers works well on 2 GB RAM servers. As flddr said, It can be related with the server resources. I want to wait for 1 more week and I will let you know if swoole servers are running correctly or stops working.

Thanks, Ozgur.

ghost commented 5 years ago

@ozgurhangisi thats great :+1: only if possible and ok to you, by time, you can do the coredump with smaller ressources to help swoole team to find this bug of

WARNING swManager_check_exit_status: worker#0 abnormal exit, status=0, signal=11

because this is a segfault :)

Have a look here if you want to help: https://github.com/swoole/swoole-src/issues/2002#issuecomment-423932192

ozgurhangisi commented 5 years ago

I just wanted to add extra information for the people who has the same problem. Because it took 4 months for me to solve this problem. :( I just updated to php 7.2.12 and problem solved. Probably it's not related with swoole. I believe that problem can be related with https://bugs.php.net/bug.php?id=76846

Thanks for your help guys.