shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

Shinken core services are crashing with no visible health issues on service state #2008

Open maltesh opened 3 years ago

maltesh commented 3 years ago

Hardware:

CPU : 24 Core RAM : 24 GB Shinken version: 2.0.3 Python Version:2.6.6 OS: Centos 6.10

Hosts Monitored: 409 Total Services : 14600

About 60% service checks are either health checks (wmi or win-rm) with check interval of 5 to 15 minutes. About 3~5 % service checks are HTTP health checks for Rabbitmq with check interval of 1 min and notification interval of 1 min.

Its standalone machine and it’s not scaled. we are running a) poller with min_worker as 6 and max_worker as 16 b) And reactionner with min_worker as 4 and max_worker with 12.

Commonly seen in logs:

Reactionner Log:

File "/usr/lib/python2.6/site-packages/shinken/action.py", line 125, in execute return self.execute() ## OS specific part File "/usr/lib/python2.6/site-packages/shinken/action.py", line 311, in execute preexec_fn=os.setsid) File "/usr/lib64/python2.6/subprocess.py", line 642, in init errread, errwrite) File "/usr/lib64/python2.6/subprocess.py", line 1238, in _execute_child raise child_exception TypeError: execve() arg 2 must contain only strings

Broker Log:

Error :   Back trace of this error: Traceback (most recent call last):   File "/usr/lib/python2.6/site-packages/shinken/daemon.py", line 864, in http_daemon_thread     self.http_daemon.run()   File "/usr/lib/python2.6/site-packages/shinken/http_daemon.py", line 283, in run     self.srv.run()   File "/usr/lib/python2.6/site-packages/shinken/http_daemon.py", line 123, in run     raise PortNotFree(msg) PortNotFree: Error: Sorry, the port 7772 is not free: No socket could be created

Poller Log:

[1606292549] Error : [Livestatus Query] Error: 'Hosts' object has no attribute 'itersorted' [1606292744] Error : [broker-master] The external module livestatus goes down unexpectedly! [1606292744] Error : [broker-master] The external module npcdmod goes down unexpectedly! [1606292744] Warning : [broker-master] Connection problem to the scheduler scheduler-master: Connexion error to http://localhost:7768/ : couldn't connect to host [1606292747] Warning : [broker-master] Connection problem to the poller poller-master: Connexion error to http://localhost:7771/ : Operation timed out after 3000

Dmesg:

TCP: too many of orphaned sockets __ratelimit: 192 callbacks suppressed TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets

Netstat;

netstat –anp | grep 7772 we see it in either FIN_WAIT1 or FIN_WAIT2 state

Currently we run sysctl -w net.ipv4.tcp_max_orphans=0 and kill and restart all shinken services to make it up and running . This happens 2 or 3 times in a day .


Please help us on overcoming this problem . Upgrading to shinken 2.4.3 will fixe the problem ? Or tuning kernel params like net.ipv4.tcp_mem, net.ipv4.tcp_fin_timeout, etc..will further help..

geektophe commented 3 years ago

Hello, the issue your're facing, it's strange. I'm running a Shinken platform with more than 2k hosts, and more than 45k services, and I never had such problems.

It's a fairly old Shinken release you are running. It should be a good idea to try to upgrade, anyway. I doubt the latest release will run on Python 2.6, through.