Open torkam opened 8 years ago
I don't clearly understand what you describe.
You have a shinken infrastructure with a two pollers, and when adding a third, this one constantly reboots ?
Hello Geektophe. First of all, thank you for your reply.
It constantly reboot when I had the second poller :/ (I don't even try with three)
Could you be clearer on your setup ? How many machines do you have, and how are the shinken services spread ?
And do you mean you're running twice the poller service on a single host ?
So I have only one master shinken (with all the services runnning on it). With that server, I was able to monitor multiples hosts (more than 200) without any problem. But now I have an issues for the remaining hosts I have to monitor. I can reach them easily. So I decided the following thing to solve my issue :
On the remote site, I deploy another shinken server, with only the poller daemon running. And they will communicate throught OpenVPN.
So this was okay when I deployed one poller (the communication throught OpenVPN was okay, and the check were working). But when I tried to deploy another, using the same method, the master was constantly rebooting.I thought it was because of my OpenVPN configuration. So I tried to deploy two pollers again (two different hosts with one poller service running on each) but without using OpenVPN (I deployed them on the same network the master is). But the problem was the same ... After that, the server was constantly rebooting.
I hope this is more clear now :)
So it's the master that reboots ?
I can't imagine it's related to shinken. It should be running under an unprivileged user, and it does not run low level code that could cause a server reboot or crash. The only crash reason I can see could be a memory exhaustion, but with 200 hosts, I'd be surprised.
I highly suspect an hardware issue.
Could you try to switch your machines roles ? The new machine becomes the master, and the old one is only a poller. Does the new one reboots also ?
Exactly, it's the master that is constantly rebooting. The pollers seem to keep working while the master isn't. I also checked the memory, everything was fine.
Ohhh but I think I didn't explained myself clearly on a part. The debian isn't rebooting. Only the shinken services. Sorry for that. Even if sometimes the server itself is running a little bit slower because of the constant reboots of shinken services, it is still up.
Ah, it's quite different.
A reason why your services are restarting could be a communication issue between the arbiter an the other services. If it detects the service as dead, it may try to send a new configuration. But the fact it can send a new configuration shows that it can communicate, almost intermittently.
Are you sure you don't have an ip conflict ?
If you're sure, could you pleas post an anonymized arbiter log sample ?
Yeah sorry for the mistake :/. I wasn't accurate.
If you are talking about an IP conflict on the master host, there is not. I checked that. I also checked if pollers (throught OpenVPN) were fully able to communicate with services on the master. I used the nmap command to verify ports were openned. There was no problem at all.
As I don't have access to my shinken right now, I will send arbiter logs here tomorrow :). Thank you
Could you also post your pollers log sample ?
Hello Geektophe,
Please find bellow the two log sample requested : Arbiter
[1458291742] INFO: [Shinken] Waiting for initial configuration
[1458291043] INFO: [Shinken] I correctly loaded the modules: []
[1458291043] INFO: [Shinken] [poller-master] Allocating new fork Worker: 0
[1458291044] INFO: [Shinken] [poller-master] Allocating new fork Worker: 1
[1458291044] INFO: [Shinken] [poller-master] Allocating new fork Worker: 2
[1458291045] INFO: [Shinken] [poller-master] Allocating new fork Worker: 3
[1458291397] INFO: [Shinken] Waiting for initial configuration
[1458291401] INFO: [Shinken] [poller-master] Init connection with scheduler-mast er at http://10.8.0.1:7768/ (10s,120s)
[1458291401] INFO: [Shinken] [poller-master] Connection OK with scheduler schedu ler-master
[1458291401] INFO: [Shinken] [poller-master] Using max workers: 4
[1458291401] INFO: [Shinken] [poller-master] Using min workers: 4
[1458291401] INFO: [Shinken] We have our schedulers: {0: {'wait_homerun': {}, 'd ata_timeout': 120, 'name': u'scheduler-master', 'hard_ssl_name_check': False, 'u ri': u'http://10.8.0.1:7768/', 'actions': {}, 'instance_id': 0, 'running_id': 14 58291031.5759714, 'timeout': 10, 'address': u'10.8.0.1', 'active': True, 'use_ss l': False, 'push_flavor': 945809, 'port': 7768, 'con': <shinken.http_client.HTTP Client object at 0x7fefd5b88050>}}
[1458291462] INFO: [Shinken] Waiting for initial configuration
[1458291586] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (10s,120s)
[1458291586] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458291586] INFO: [Shinken] [poller-master] Using max workers: 4
[1458291586] INFO: [Shinken] [poller-master] Using min workers: 4
[1458291586] INFO: [Shinken] We have our schedulers: {0: {'wait_homerun': {}, 'data_timeout': 120, 'name': u'scheduler-master', 'hard_ssl_name_check': False, 'uri': u'http://10.8.0.1:7768/', 'actions': {}, 'instance_id': 0, 'running_id': 1458291031.5759714, 'timeout': 10, 'address': u'10.8.0.1', 'active': True, 'use_ssl': False, 'push_flavor': 920950, 'port': 7768, 'con': <shinken.http_client.HTTPClient object at 0x7fefd5b7ba90>}}
[1458291677] INFO: [Shinken] Waiting for initial configuration
[1458291680] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (10s,120s)
[1458291680] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458291680] INFO: [Shinken] [poller-master] Using max workers: 4
[1458291680] INFO: [Shinken] [poller-master] Using min workers: 4
[1458291680] INFO: [Shinken] We have our schedulers: {0: {'wait_homerun': {}, 'data_timeout': 120, 'name': u'scheduler-master', 'hard_ssl_name_check': False, 'uri': u'http://10.8.0.1:7768/', 'actions': {}, 'instance_id': 0, 'running_id': 1458291031.5759714, 'timeout': 10, 'address': u'10.8.0.1', 'active': True, 'use_ssl': False, 'push_flavor': 219573, 'port': 7768, 'con': <shinken.http_client.HTTPClient object at 0x7fefd5b887d0>}}
[1458291742] INFO: [Shinken] Waiting for initial configuration
Poller
[1458292226] WARNING: [Shinken] [All] The receiver receiver-master manage a unmanage d configuration
[1458292226] INFO: [Shinken] Dispatching Realm All
[1458292226] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458292226] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458292226] INFO: [Shinken] [All] Dispatching configuration 0
[1458292226] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-mast er
[1458292236] ERROR: [Shinken] Failed sending configuration for scheduler-master: Con nection error to http://10.8.0.1:7768/ : Operation timed out after 10000 millisecond s with 0 bytes received
[1458292236] WARNING: [Shinken] [All] configuration dispatching error for scheduler scheduler-master
[1458292236] WARNING: [Shinken] All schedulers configurations are not dispatched, 1 are missing
[1458292236] INFO: [Shinken] I ask reactionner-master to wait a new conf
[1458292236] INFO: [Shinken] I ask poller-simedit to wait a new conf
[1458292236] INFO: [Shinken] I ask poller-master to wait a new conf
[1458292236] INFO: [Shinken] I ask poller-tetrarc to wait a new conf
[1458292236] INFO: [Shinken] I ask broker-master to wait a new conf
[1458292236] INFO: [Shinken] I ask receiver-master to wait a new conf
[1458292238] INFO: [Shinken] Scheduler configuration 0 is unmanaged!!
[1458292238] WARNING: [Shinken] Missing satellite reactionner for configuration 0:
[1458292238] WARNING: [Shinken] Missing satellite poller for configuration 0:
[1458292238] WARNING: [Shinken] Missing satellite broker for configuration 0:
[1458292238] WARNING: [Shinken] Missing satellite receiver for configuration 0:
[1458292238] INFO: [Shinken] Dispatching Realm All
[1458292238] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458292238] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458292238] INFO: [Shinken] [All] Dispatching configuration 0
[1458292238] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-mast er
[1458292238] INFO: [Shinken] [All] Dispatch OK of conf in scheduler scheduler-master
[1458292238] INFO: [Shinken] OK, all schedulers configurations are dispatched :)
[1458292238] INFO: [Shinken] [All] Dispatching reactionner satellite with order: rea ctionner-master (spare:False),
[1458292238] INFO: [Shinken] [All] Trying to send configuration to reactionner react ionner-master
[1458292238] INFO: [Shinken] [All] Dispatch OK of configuration 0 to reactionner rea ctionner-master
[1458292238] INFO: [Shinken] [All] OK, no more reactionner sent need
[1458292238] INFO: [Shinken] [All] Dispatching poller satellite with order: poller-t etrarc (spare:False), poller-master (spare:False), poller-simedit (spare:False),
[1458292238] INFO: [Shinken] [All] Trying to send configuration to poller poller-tet rarc
[1458292239] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-tetrarc
[1458292239] INFO: [Shinken] [All] Trying to send configuration to poller poller-master
[1458292239] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-master
[1458292239] INFO: [Shinken] [All] Trying to send configuration to poller poller-simedit
[1458292239] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-simedit
[1458292239] INFO: [Shinken] [All] OK, no more poller sent need
[1458292239] INFO: [Shinken] [All] Dispatching broker satellite with order: broker-master (spare:False),
[1458292239] INFO: [Shinken] [All] Trying to send configuration to broker broker-master
[1458292242] INFO: [Shinken] [All] Dispatch OK of configuration 0 to broker broker-master
[1458292242] INFO: [Shinken] [All] OK, no more broker sent need
[1458292242] INFO: [Shinken] [All] Dispatching receiver satellite with order: receiver-master (spare:False),
[1458292242] INFO: [Shinken] [All] Trying to send configuration to receiver receiver-master
[1458292242] INFO: [Shinken] [All] Dispatch OK of configuration 0 to receiver receiver-master
[1458292242] INFO: [Shinken] (http://localhost:7773/)
[1458292242] INFO: [Shinken] [All] OK, no more receiver sent need
[1458292242] INFO: [Shinken] [All] Trying to send configuration to receiver receiver-master
[1458292242] INFO: [Shinken] [All] Dispatch OK of configuration to receiver receiver-master
Please, feel free to ask if you need more informations
Are the services configured with localhost for address (for instance http://localhost:7773/) the result of your logs anonymization, or is it your real setup ?
Yes they are all configured with the localhost address, expect for the Scheduler daemon (using 10.8.0.1 -> address of the server on the VPN). I used an online tutorial saying that only the scheduler had to be accessible for the pollers :/ Maybe it was a mistake ? (this tutorial -> http://shinkenlab.io/online-course-6-lan-scalability/)
In fact, I think your poller issue comes from that.
If I've correctly understood your setup, you have two pollers, each on a different machine, but both with localhost as address.
The addresses are not only important for the scheduler, but for all the services, because it's used by the arbiter to send each service its configuration. Using localhost works if you have a single machine. As soon as there's more, and you have more than one instance for a service, it's mandatory to use the real address (dns name or ip address, it's up to you).
Could you try to replace localhost with the hosts real address and tell me if the problem remains ?
I want to add that on the two pollers, only the poller deamon is started.
Sure I do that : On every service configuration file (arbiter, scheduler, poller-master, reactionner and broker) on the master (I have nothing to do on the pollers, right ?), I replace "localhost" by "10.8.0.1".
Correct ?
Yes, except your second poller that should configured its own address :)
Okay so I did the modification requested. The IP (10.8.0.1) is configured in the following six files : arbiter-master.cfg, broker-master.cfg, poller-master.cfg, reactionner-master.cfg, receiver-master.cfg and scheduler-master.cfg. On the two pollers, the configuration of the file : poller-master.cfg is with the 'localhost' address.
As a confirmation, please find the ifconfig of the master (you will see that that address I am using is the one related to the tun0 network card) :
eth0 Link encap:Ethernet HWaddr 00:50:56:a7:7b:5d
inet adr:172.16.0.209 Bcast:172.16.0.255 Masque:255.255.255.0
adr inet6: fe80::250:56ff:fea7:7b5d/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:106926592 errors:0 dropped:960226 overruns:0 frame:0
TX packets:73598994 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:103117172720 (96.0 GiB) TX bytes:13616107920 (12.6 GiB)
lo Link encap:Boucle locale
inet adr:127.0.0.1 Masque:255.0.0.0
adr inet6: ::1/128 Scope:Hôte
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:211875754 errors:0 dropped:0 overruns:0 frame:0
TX packets:211875754 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:0
RX bytes:114477153077 (106.6 GiB) TX bytes:114477153077 (106.6 GiB)
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet adr:10.8.0.1 P-t-P:10.8.0.2 Masque:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:47271 errors:0 dropped:0 overruns:0 frame:0
TX packets:44660 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:100
RX bytes:6957146 (6.6 MiB) TX bytes:5307558 (5.0 MiB)
But there are still problems in the logs :/
Arbiterd.log
[1458312949] TIMEPERIOD TRANSITION: NotSundayandMonday;-1;1
[1458312949] TIMEPERIOD TRANSITION: none;-1;0
[1458312949] TIMEPERIOD TRANSITION: 24x7;-1;1
[1458312949] TIMEPERIOD TRANSITION: workhours;-1;1
[1458313011] WARNING: [Shinken] Add failed attempt to scheduler-master (1/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313072] WARNING: [Shinken] Add failed attempt to scheduler-master (2/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313195] WARNING: [Shinken] Add failed attempt to scheduler-master (1/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313318] WARNING: [Shinken] Add failed attempt to scheduler-master (1/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313322] WARNING: [Shinken] Scheduler scheduler-master did not managed its configuration 0, I am not happy.
[1458313322] WARNING: [Shinken] [All] The reactionner reactionner-master manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The poller poller-tetrarc manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The poller poller-master manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The poller poller-simedit manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The broker broker-master manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The receiver receiver-master manage a unmanaged configuration
[1458313322] INFO: [Shinken] Dispatching Realm All
[1458313322] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458313322] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458313322] INFO: [Shinken] [All] Dispatching configuration 0
[1458313322] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-master
[1458313326] ERROR: [Shinken] Failed sending configuration for scheduler-master: Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313326] WARNING: [Shinken] [All] configuration dispatching error for scheduler scheduler-master
[1458313326] WARNING: [Shinken] All schedulers configurations are not dispatched, 1 are missing
[1458313326] INFO: [Shinken] I ask reactionner-master to wait a new conf
[1458313328] INFO: [Shinken] I ask poller-simedit to wait a new conf
[1458313329] INFO: [Shinken] I ask poller-master to wait a new conf
[1458313330] INFO: [Shinken] I ask poller-tetrarc to wait a new conf
[1458313330] INFO: [Shinken] I ask broker-master to wait a new conf
[1458313340] INFO: [Shinken] I ask receiver-master to wait a new conf
[1458313354] INFO: [Shinken] Scheduler configuration 0 is unmanaged!!
[1458313354] WARNING: [Shinken] Missing satellite reactionner for configuration 0:
[1458313354] WARNING: [Shinken] Missing satellite poller for configuration 0:
[1458313354] WARNING: [Shinken] Missing satellite broker for configuration 0:
[1458313354] WARNING: [Shinken] Missing satellite receiver for configuration 0:
[1458313354] INFO: [Shinken] Dispatching Realm All
[1458313354] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458313354] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458313354] INFO: [Shinken] [All] Dispatching configuration 0
[1458313354] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-master
[1458313355] INFO: [Shinken] [All] Dispatch OK of conf in scheduler scheduler-master
[1458313355] INFO: [Shinken] OK, all schedulers configurations are dispatched :)
[1458313355] INFO: [Shinken] [All] Dispatching reactionner satellite with order: reactionner-master (spare:False),
[1458313355] INFO: [Shinken] [All] Trying to send configuration to reactionner reactionner-master
[1458313356] INFO: [Shinken] [All] Dispatch OK of configuration 0 to reactionner reactionner-master
[1458313356] INFO: [Shinken] [All] OK, no more reactionner sent need
[1458313356] INFO: [Shinken] [All] Dispatching poller satellite with order: poller-tetrarc (spare:False), poller-master (spare:False), poller-simedit (spare:False),
[1458313356] INFO: [Shinken] [All] Trying to send configuration to poller poller-tetrarc
[1458313357] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-tetrarc
[1458313357] INFO: [Shinken] [All] Trying to send configuration to poller poller-master
[1458313358] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-master
[1458313358] INFO: [Shinken] [All] Trying to send configuration to poller poller-simedit
Pollerd.log
[1458312949] INFO: [Shinken] I correctly loaded the modules: []
[1458312950] INFO: [Shinken] [poller-master] Allocating new fork Worker: 0
[1458312950] INFO: [Shinken] [poller-master] Allocating new fork Worker: 1
[1458312950] INFO: [Shinken] [poller-master] Allocating new fork Worker: 2
[1458312951] INFO: [Shinken] [poller-master] Allocating new fork Worker: 3
[1458312954] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_cl ient.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458312954] INFO: [Shinken] [poller-master] Init connection with scheduler-mast er at http://10.8.0.1:7768/ (3s,120s)
[1458312957] WARNING: [Shinken] [poller-master] Scheduler scheduler-master is no t initialized or has network problem: Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458312958] INFO: [Shinken] [poller-master] Init connection with scheduler-mast er at http://10.8.0.1:7768/ (3s,120s)
[1458312961] INFO: [Shinken] [poller-master] Connection OK with scheduler schedu ler-master
[1458313228] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313228] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313228] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458313233] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313233] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313236] WARNING: [Shinken] [poller-master] Scheduler scheduler-master is not initialized or has network problem: Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313237] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313237] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458313241] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313241] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313244] WARNING: [Shinken] [poller-master] Scheduler scheduler-master is not initialized or has network problem: Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313245] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313247] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458313250] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313250] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
What I see is that there is a communication issue with scheduler on 10.8.0.1 (timeout).
Are you sure your services correctly listen on this address ? (check with netstat -lntp
) Nor the arbiter nor the pollers manage to connect to the scheduler, but the arbiter seems to manage to send the configuration to the pollers.
There's clearly odd things happening because of your tun0
device.
Could you try your setup with openvpn turned off and with addresses on the same subnet ?
If it still fails, could you paste all your master-*.cfg
files, plus the ifconfig
output of each of your machines ? (with their name in in the configuration).
Hey Geektophe. Hope your week end was good :)
_Here is the result of the Netstat : _
Proto Recv-Q Send-Q Adresse locale Adresse distante Etat PID/Program name
tcp 0 0 127.0.0.1:27017 0.0.0.0:* LISTEN 1999/mongod
tcp 0 0 127.0.0.1:40910 0.0.0.0:* LISTEN 13501/python2.7
tcp 0 0 127.0.0.1:57519 0.0.0.0:* LISTEN 13543/python2.7
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1690/rpcbind
tcp 0 0 0.0.0.0:50000 0.0.0.0:* LISTEN 13630/python2.7
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 3156/nginx
tcp 0 0 127.0.0.1:28017 0.0.0.0:* LISTEN 1999/mongod
tcp 0 0 127.0.0.1:42003 0.0.0.0:* LISTEN 13366/python2.7
tcp 0 0 0.0.0.0:2003 0.0.0.0:* LISTEN 2264/python
tcp 0 0 0.0.0.0:2004 0.0.0.0:* LISTEN 2264/python
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 3246/sshd
tcp 0 0 0.0.0.0:7767 0.0.0.0:* LISTEN 13601/python2.7
tcp 0 0 0.0.0.0:87 0.0.0.0:* LISTEN 3156/nginx
tcp 0 0 127.0.0.1:3031 0.0.0.0:* LISTEN 2222/uwsgi
tcp 0 0 0.0.0.0:7768 0.0.0.0:* LISTEN 13321/python2.7
tcp 0 0 0.0.0.0:7769 0.0.0.0:* LISTEN 13410/python2.7
tcp 0 0 127.0.0.1:50905 0.0.0.0:* LISTEN 13323/python2.7
tcp 0 0 0.0.0.0:9465 0.0.0.0:* LISTEN 17353/openvpn
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 3103/exim4
tcp 0 0 10.8.0.1:7770 0.0.0.0:* LISTEN 13542/python2.7
tcp 0 0 0.0.0.0:7002 0.0.0.0:* LISTEN 2264/python
tcp 0 0 0.0.0.0:53978 0.0.0.0:* LISTEN 1721/rpc.statd
tcp 0 0 127.0.0.1:52411 0.0.0.0:* LISTEN 13458/python2.7
tcp 0 0 0.0.0.0:7771 0.0.0.0:* LISTEN 13364/python2.7
tcp 0 0 0.0.0.0:7772 0.0.0.0:* LISTEN 13456/python2.7
tcp 0 0 0.0.0.0:7773 0.0.0.0:* LISTEN 13499/python2.7
tcp 0 0 127.0.0.1:38209 0.0.0.0:* LISTEN 13412/python2.7
tcp6 0 0 :::111 :::* LISTEN 1690/rpcbind
tcp6 0 0 :::80 :::* LISTEN 2572/apache2
tcp6 0 0 :::22 :::* LISTEN 3246/sshd
tcp6 0 0 :::87 :::* LISTEN 3156/nginx
tcp6 0 0 ::1:25 :::* LISTEN 3103/exim4
tcp6 0 0 :::33638 :::* LISTEN 1721/rpc.statd
When I tried with openvpn turned off, and with pollers I was able to access without it, it had the same behavior :/ :(
Here are the master files : Arbiter-master.cfg
define arbiter {
arbiter_name arbiter-master
#host_name node1 ; CHANGE THIS if you have several Arbiters (like with a spare)
address 10.8.0.1 ; DNS name or IP
port 7770
spare 0 ; 1 = is a spare, 0 = is not a spare
## Interesting modules:
# - named-pipe = Open the named pipe nagios.cmd
# - mongodb = Load hosts from a mongodb database
# - pickle-retention-arbiter = Save data before exiting
# - nsca = NSCA server
# - vmware-auto-linking = Lookup at Vphere server for dependencies
# - import-glpi = Import configuration from GLPI (need plugin monitoring for GLPI in server side)
# - tsca = TSCA server
# - mysql-mport = Load configuration from a MySQL database
# - ws-arbiter = WebService for pushing results to the arbiter
# - collectd = Receive collectd perfdata
# - snmp-booster = Snmp bulk polling module, configuration linker
# - import-landscape = Import hosts from Landscape (Ubuntu/Canonical management tool)
# - aws = Import hosts from Amazon AWS (here EC2)
# - ip-tag = Tag a host based on it's IP range
# - file-tag = Tag a host if it's on a flat file
# - csv-tag = Tag a host from the content of a CSV file
modules
# Enable https or not
use_ssl 0
# enable certificate/hostname check, will avoid man in the middle attacks
hard_ssl_name_check 0
## Uncomment these lines in a HA architecture so the master and slaves know
## how long they may wait for each other.
#timeout 3 ; Ping timeout
#data_timeout 120 ; Data send timeout
#max_check_attempts 3 ; If ping fails N or more, then the node is dead
#check_interval 60 ; Ping node every N seconds
}
Broker-master.cfg
define broker {
broker_name broker-master
address 10.8.0.1
port 7772
spare 0
## Optional
manage_arbiters 1 ; Take data from Arbiter. There should be only one
; broker for the arbiter.
manage_sub_realms 1 ; Does it take jobs from schedulers of sub-Realms?
timeout 3 ; Ping timeout
data_timeout 120 ; Data send timeout
max_check_attempts 3 ; If ping fails N or more, then the node is dead
check_interval 60 ; Ping node every N seconds
## Modules
# Default: None
# Interesting modules that can be used:
# - simple-log = just all logs into one file
# - livestatus = livestatus listener
# - tondodb-mysql = NDO DB support (deprecated)
# - npcdmod = Use the PNP addon
# - graphite = Use a Graphite time series DB for perfdata
# - webui = Shinken Web interface
# - glpidb = Save data in GLPI MySQL database
modules webui2, graphite, livestatus, mongo-logs
# Enable https or not
use_ssl 0
# enable certificate/hostname check, will avoid man in the middle attacks
hard_ssl_name_check 0
## Advanced
realm All
}
Poller-master.cfg
define poller {
poller_name poller-master
address 10.8.0.1
port 7771
## Optional
spare 0 ; 1 = is a spare, 0 = is not a spare
manage_sub_realms 0 ; Does it take jobs from schedulers of sub-Realms?
min_workers 0 ; Starts with N processes (0 = 1 per CPU)
max_workers 0 ; No more than N processes (0 = 1 per CPU)
processes_by_worker 256 ; Each worker manages N checks
polling_interval 1 ; Get jobs from schedulers each N seconds
timeout 3 ; Ping timeout
data_timeout 120 ; Data send timeout
max_check_attempts 3 ; If ping fails N or more, then the node is dead
check_interval 60 ; Ping node every N seconds
## Interesting modules that can be used:
# - booster-nrpe = Replaces the check_nrpe binary. Therefore it
# enhances performances when there are lot of NRPE
# calls.
# - named-pipe = Allow the poller to read a nagios.cmd named pipe.
# This permits the use of distributed check_mk checks
# should you desire it.
# - snmp-booster = Snmp bulk polling module
modules
## Advanced Features
#passive 0 ; For DMZ monitoring, set to 1 so the connections
; will be from scheduler -> poller.
# Poller tags are the tag that the poller will manage. Use None as tag name to manage
# untaggued checks
#poller_tags None
# Enable https or not
use_ssl 0
# enable certificate/hostname check, will avoid man in the middle attacks
hard_ssl_name_check 0
realm All
}
Reactionner-master.cfg
define reactionner {
reactionner_name reactionner-master
address 10.8.0.1
port 7769
spare 0
## Optionnal
manage_sub_realms 0 ; Does it take jobs from schedulers of sub-Realms?
min_workers 1 ; Starts with N processes (0 = 1 per CPU)
max_workers 15 ; No more than N processes (0 = 1 per CPU)
polling_interval 1 ; Get jobs from schedulers each 1 second
timeout 3 ; Ping timeout
data_timeout 120 ; Data send timeout
max_check_attempts 3 ; If ping fails N or more, then the node is dead
check_interval 60 ; Ping node every N seconds
## Modules
modules
# Reactionner tags are the tag that the reactionner will manage. Use None as tag name to manage
# untaggued notification/event handlers
#reactionner_tags None
# Enable https or not
use_ssl 0
# enable certificate/hostname check, will avoid man in the middle attacks
hard_ssl_name_check 0
## Advanced
realm All
}
Receiver-master.cfg
define receiver {
receiver_name receiver-master
address 10.8.0.1
port 7773
spare 0
## Optional parameters
timeout 3 ; Ping timeout
data_timeout 120 ; Data send timeout
max_check_attempts 3 ; If ping fails N or more, then the node is dead
check_interval 60 ; Ping node every N seconds
## Modules for Receiver
# - named-pipe = Open the named pipe nagios.cmd
# - nsca = NSCA server
# - tsca = TSCA server
# - ws-arbiter = WebService for pushing results to the arbiter
# - collectd = Receive collectd perfdata
modules
# Enable https or not
use_ssl 0
# enable certificate/hostname check, will avoid man in the middle attacks
hard_ssl_name_check 0
## Advanced Feature
direct_routing 0 ; If enabled, it will directly send commands to the
; schedulers if it knows about the hostname in the
; command.
realm All
}
Scheduler-master.cfg
define scheduler {
scheduler_name scheduler-master ; Just the name
address 10.8.0.1 ; IP or DNS address of the daemon
port 7768 ; TCP port of the daemon
## Optional
spare 0 ; 1 = is a spare, 0 = is not a spare
weight 1 ; Some schedulers can manage more hosts than others
timeout 10 ; Ping timeout
data_timeout 120 ; Data send timeout
max_check_attempts 3 ; If ping fails N or more, then the node is dead
check_interval 60 ; Ping node every N seconds
## Interesting modules that can be used:
# - pickle-retention-file = Save data before exiting in flat-file
# - mem-cache-retention = Same, but in a MemCache server
# - redis-retention = Same, but in a Redis server
# - retention-mongodb = Same, but in a MongoDB server
# - nagios-retention = Read retention info from a Nagios retention file
# (does not save, only read)
# - snmp-booster = Snmp bulk polling module
modules
## Advanced Features
# Realm is for multi-datacenters
realm All
# Skip initial broks creation. Boot fast, but some broker modules won't
# work with it! (like livestatus for example)
skip_initial_broks 0
# In NATted environments, you declare each satellite ip[:port] as seen by
# *this* scheduler (if port not set, the port declared by satellite itself
# is used)
#satellitemap poller-1=1.2.3.4:7771, reactionner-1=1.2.3.5:7769, ...
# Enable https or not
use_ssl 0
# enable certificate/hostname check, will avoid man in the middle attacks
hard_ssl_name_check 0
}
And please find bellow the ifconfig output of each machine : Master ! (name APS-SHINKEN)
eth0 Link encap:Ethernet HWaddr 00:50:56:a7:7b:5d
inet adr:172.16.0.209 Bcast:172.16.0.255 Masque:255.255.255.0
adr inet6: fe80::250:56ff:fea7:7b5d/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:149704792 errors:0 dropped:1197564 overruns:0 frame:0
TX packets:103857328 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:143378425822 (133.5 GiB) TX bytes:19075490868 (17.7 GiB)
lo Link encap:Boucle locale
inet adr:127.0.0.1 Masque:255.0.0.0
adr inet6: ::1/128 Scope:Hôte
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:300231463 errors:0 dropped:0 overruns:0 frame:0
TX packets:300231463 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:0
RX bytes:139969776767 (130.3 GiB) TX bytes:139969776767 (130.3 GiB)
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet adr:10.8.0.1 P-t-P:10.8.0.2 Masque:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:2315062 errors:0 dropped:0 overruns:0 frame:0
TX packets:1878751 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:100
RX bytes:280240229 (267.2 MiB) TX bytes:185767322 (177.1 MiB)
Poller-Tetrarc ! (name APS-SHINKEN-POLLER)
eth0 Link encap:Ethernet HWaddr 00:0c:29:db:99:a9
inet adr:192.168.5.221 Bcast:192.168.5.255 Masque:255.255.255.0
adr inet6: fe80::20c:29ff:fedb:99a9/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:16972058 errors:0 dropped:32 overruns:0 frame:0
TX packets:12788698 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:4984429859 (4.6 GiB) TX bytes:2460385992 (2.2 GiB)
lo Link encap:Boucle locale
inet adr:127.0.0.1 Masque:255.0.0.0
adr inet6: ::1/128 Scope:Hôte
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:11240530 errors:0 dropped:0 overruns:0 frame:0
TX packets:11240530 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:0
RX bytes:1131973704 (1.0 GiB) TX bytes:1131973704 (1.0 GiB)
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet adr:10.8.0.31 P-t-P:10.8.0.29 Masque:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:2766907 errors:0 dropped:0 overruns:0 frame:0
TX packets:2269159 errors:0 dropped:179 overruns:0 carrier:0
collisions:0 lg file transmission:100
RX bytes:273196445 (260.5 MiB) TX bytes:273976207 (261.2 MiB)
Poller-Simedit ! (name APS-SHINKEN-POLLER)
eth0 Link encap:Ethernet HWaddr 00:50:56:8a:7d:5e
inet adr:192.168.51.212 Bcast:192.168.51.255 Masque:255.255.255.0
adr inet6: fe80::250:56ff:fe8a:7d5e/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:11440625 errors:0 dropped:14788 overruns:0 frame:0
TX packets:5790634 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:1403722971 (1.3 GiB) TX bytes:1108638282 (1.0 GiB)
lo Link encap:Boucle locale
inet adr:127.0.0.1 Masque:255.0.0.0
adr inet6: ::1/128 Scope:Hôte
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:8610438 errors:0 dropped:0 overruns:0 frame:0
TX packets:8610438 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:0
RX bytes:683497358 (651.8 MiB) TX bytes:683497358 (651.8 MiB)
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet adr:10.8.0.30 P-t-P:10.8.0.29 Masque:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:3105146 errors:0 dropped:0 overruns:0 frame:0
TX packets:3859160 errors:0 dropped:619727 overruns:0 carrier:0
collisions:0 lg file transmission:100
RX bytes:330039673 (314.7 MiB) TX bytes:526414428 (502.0 MiB)
Here are all the informations requested :). Please ask if you need anything else
Is the problem still open?
Hello Naparuba.
Yes, unfortunately, the problem is still open :/. I was waiting for an answer of geektophe but maybe it's more complex than expected :/
Hello,
I have an issue with my Shinken configuration. I am using it to monitor multiples equipments. For now, I successfully monitor around 230 hosts, representing 1100 services. Some of theses hosts are not accessible directly, so I used a poller with VPN connection with the master (using OpenVPN). The VPN configuration was great (remote poller and master was able to communicate). But the Shinken server was constantly rebooting ... (no access to the web interface and so on).
I thought it was because of the OpenVPN, so I tried to deploy pollers without using this "soft". I successfully deployed one poller, it was working perfectly with two hosts. But when I added another poller ... same behaviour. The server was still reboooting all the time.
Please find bellow one remote poller configuration :
Tell me if you need me to join a specific log file with that case ?
Per advance, thank you for your help. Sincerely, Jordan HENRY