Shinken server reboot when multiple pollers configured

torkam commented 8 years ago

Hello,

I have an issue with my Shinken configuration. I am using it to monitor multiples equipments. For now, I successfully monitor around 230 hosts, representing 1100 services. Some of theses hosts are not accessible directly, so I used a poller with VPN connection with the master (using OpenVPN). The VPN configuration was great (remote poller and master was able to communicate). But the Shinken server was constantly rebooting ... (no access to the web interface and so on).

I thought it was because of the OpenVPN, so I tried to deploy pollers without using this "soft". I successfully deployed one poller, it was working perfectly with two hosts. But when I added another poller ... same behaviour. The server was still reboooting all the time.

Please find bellow one remote poller configuration :

define poller {
    poller_name     poller-ccp
    address         172.16.17.103
    port            7771
    spare           0

    ## Optional
    manage_sub_realms   0
    min_workers         0
    max_workers         0
    processes_by_worker 256
    polling_interval    1
    poller_tags         ccp

    realm   All
}

Tell me if you need me to join a specific log file with that case ?

Per advance, thank you for your help. Sincerely, Jordan HENRY

geektophe commented 8 years ago

I don't clearly understand what you describe.

You have a shinken infrastructure with a two pollers, and when adding a third, this one constantly reboots ?

torkam commented 8 years ago

Hello Geektophe. First of all, thank you for your reply.

It constantly reboot when I had the second poller :/ (I don't even try with three)

geektophe commented 8 years ago

Could you be clearer on your setup ? How many machines do you have, and how are the shinken services spread ?

And do you mean you're running twice the poller service on a single host ?

torkam commented 8 years ago

So I have only one master shinken (with all the services runnning on it). With that server, I was able to monitor multiples hosts (more than 200) without any problem. But now I have an issues for the remaining hosts I have to monitor. I can reach them easily. So I decided the following thing to solve my issue :

On the remote site, I deploy another shinken server, with only the poller daemon running. And they will communicate throught OpenVPN.

So this was okay when I deployed one poller (the communication throught OpenVPN was okay, and the check were working). But when I tried to deploy another, using the same method, the master was constantly rebooting.I thought it was because of my OpenVPN configuration. So I tried to deploy two pollers again (two different hosts with one poller service running on each) but without using OpenVPN (I deployed them on the same network the master is). But the problem was the same ... After that, the server was constantly rebooting.

I hope this is more clear now :)

geektophe commented 8 years ago

So it's the master that reboots ?

I can't imagine it's related to shinken. It should be running under an unprivileged user, and it does not run low level code that could cause a server reboot or crash. The only crash reason I can see could be a memory exhaustion, but with 200 hosts, I'd be surprised.

I highly suspect an hardware issue.

Could you try to switch your machines roles ? The new machine becomes the master, and the old one is only a poller. Does the new one reboots also ?

torkam commented 8 years ago

Exactly, it's the master that is constantly rebooting. The pollers seem to keep working while the master isn't. I also checked the memory, everything was fine.

Ohhh but I think I didn't explained myself clearly on a part. The debian isn't rebooting. Only the shinken services. Sorry for that. Even if sometimes the server itself is running a little bit slower because of the constant reboots of shinken services, it is still up.

geektophe commented 8 years ago

Ah, it's quite different.

A reason why your services are restarting could be a communication issue between the arbiter an the other services. If it detects the service as dead, it may try to send a new configuration. But the fact it can send a new configuration shows that it can communicate, almost intermittently.

Are you sure you don't have an ip conflict ?

If you're sure, could you pleas post an anonymized arbiter log sample ?

torkam commented 8 years ago

Yeah sorry for the mistake :/. I wasn't accurate.

If you are talking about an IP conflict on the master host, there is not. I checked that. I also checked if pollers (throught OpenVPN) were fully able to communicate with services on the master. I used the nmap command to verify ports were openned. There was no problem at all.

As I don't have access to my shinken right now, I will send arbiter logs here tomorrow :). Thank you

geektophe commented 8 years ago

Could you also post your pollers log sample ?

torkam commented 8 years ago

Hello Geektophe,

Please find bellow the two log sample requested : Arbiter

[1458291742] INFO: [Shinken] Waiting for initial configuration
[1458291043] INFO: [Shinken] I correctly loaded the modules: []
[1458291043] INFO: [Shinken] [poller-master] Allocating new fork Worker: 0
[1458291044] INFO: [Shinken] [poller-master] Allocating new fork Worker: 1
[1458291044] INFO: [Shinken] [poller-master] Allocating new fork Worker: 2
[1458291045] INFO: [Shinken] [poller-master] Allocating new fork Worker: 3
[1458291397] INFO: [Shinken] Waiting for initial configuration
[1458291401] INFO: [Shinken] [poller-master] Init connection with scheduler-mast      er at http://10.8.0.1:7768/ (10s,120s)
[1458291401] INFO: [Shinken] [poller-master] Connection OK with scheduler schedu      ler-master
[1458291401] INFO: [Shinken] [poller-master] Using max workers: 4
[1458291401] INFO: [Shinken] [poller-master] Using min workers: 4
[1458291401] INFO: [Shinken] We have our schedulers: {0: {'wait_homerun': {}, 'd      ata_timeout': 120, 'name': u'scheduler-master', 'hard_ssl_name_check': False, 'u      ri': u'http://10.8.0.1:7768/', 'actions': {}, 'instance_id': 0, 'running_id': 14      58291031.5759714, 'timeout': 10, 'address': u'10.8.0.1', 'active': True, 'use_ss      l': False, 'push_flavor': 945809, 'port': 7768, 'con': <shinken.http_client.HTTP      Client object at 0x7fefd5b88050>}}
[1458291462] INFO: [Shinken] Waiting for initial configuration
[1458291586] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (10s,120s)
[1458291586] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458291586] INFO: [Shinken] [poller-master] Using max workers: 4
[1458291586] INFO: [Shinken] [poller-master] Using min workers: 4
[1458291586] INFO: [Shinken] We have our schedulers: {0: {'wait_homerun': {}, 'data_timeout': 120, 'name': u'scheduler-master', 'hard_ssl_name_check': False, 'uri': u'http://10.8.0.1:7768/', 'actions': {}, 'instance_id': 0, 'running_id': 1458291031.5759714, 'timeout': 10, 'address': u'10.8.0.1', 'active': True, 'use_ssl': False, 'push_flavor': 920950, 'port': 7768, 'con': <shinken.http_client.HTTPClient object at 0x7fefd5b7ba90>}}
[1458291677] INFO: [Shinken] Waiting for initial configuration
[1458291680] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (10s,120s)
[1458291680] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458291680] INFO: [Shinken] [poller-master] Using max workers: 4
[1458291680] INFO: [Shinken] [poller-master] Using min workers: 4
[1458291680] INFO: [Shinken] We have our schedulers: {0: {'wait_homerun': {}, 'data_timeout': 120, 'name': u'scheduler-master', 'hard_ssl_name_check': False, 'uri': u'http://10.8.0.1:7768/', 'actions': {}, 'instance_id': 0, 'running_id': 1458291031.5759714, 'timeout': 10, 'address': u'10.8.0.1', 'active': True, 'use_ssl': False, 'push_flavor': 219573, 'port': 7768, 'con': <shinken.http_client.HTTPClient object at 0x7fefd5b887d0>}}
[1458291742] INFO: [Shinken] Waiting for initial configuration

Poller

[1458292226] WARNING: [Shinken] [All] The receiver receiver-master manage a unmanage                                                                                                                                                         d configuration
[1458292226] INFO: [Shinken] Dispatching Realm All
[1458292226] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458292226] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458292226] INFO: [Shinken] [All] Dispatching configuration 0
[1458292226] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-mast                                                                                                                                                         er
[1458292236] ERROR: [Shinken] Failed sending configuration for scheduler-master: Con                                                                                                                                                         nection error to http://10.8.0.1:7768/ : Operation timed out after 10000 millisecond                                                                                                                                                         s with 0 bytes received
[1458292236] WARNING: [Shinken] [All] configuration dispatching error for scheduler                                                                                                                                                          scheduler-master
[1458292236] WARNING: [Shinken] All schedulers configurations are not dispatched, 1                                                                                                                                                          are missing
[1458292236] INFO: [Shinken] I ask reactionner-master to wait a new conf
[1458292236] INFO: [Shinken] I ask poller-simedit to wait a new conf
[1458292236] INFO: [Shinken] I ask poller-master to wait a new conf
[1458292236] INFO: [Shinken] I ask poller-tetrarc to wait a new conf
[1458292236] INFO: [Shinken] I ask broker-master to wait a new conf
[1458292236] INFO: [Shinken] I ask receiver-master to wait a new conf
[1458292238] INFO: [Shinken] Scheduler configuration 0 is unmanaged!!
[1458292238] WARNING: [Shinken] Missing satellite reactionner for configuration 0:
[1458292238] WARNING: [Shinken] Missing satellite poller for configuration 0:
[1458292238] WARNING: [Shinken] Missing satellite broker for configuration 0:
[1458292238] WARNING: [Shinken] Missing satellite receiver for configuration 0:
[1458292238] INFO: [Shinken] Dispatching Realm All
[1458292238] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458292238] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458292238] INFO: [Shinken] [All] Dispatching configuration 0
[1458292238] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-mast                                                                                                                                                         er
[1458292238] INFO: [Shinken] [All] Dispatch OK of conf in scheduler scheduler-master
[1458292238] INFO: [Shinken] OK, all schedulers configurations are dispatched :)
[1458292238] INFO: [Shinken] [All] Dispatching reactionner satellite with order: rea                                                                                                                                                         ctionner-master (spare:False),
[1458292238] INFO: [Shinken] [All] Trying to send configuration to reactionner react                                                                                                                                                         ionner-master
[1458292238] INFO: [Shinken] [All] Dispatch OK of configuration 0 to reactionner rea                                                                                                                                                         ctionner-master
[1458292238] INFO: [Shinken] [All] OK, no more reactionner sent need
[1458292238] INFO: [Shinken] [All] Dispatching poller satellite with order: poller-t                                                                                                                                                         etrarc (spare:False), poller-master (spare:False), poller-simedit (spare:False),
[1458292238] INFO: [Shinken] [All] Trying to send configuration to poller poller-tet                                                                                                                                                         rarc
[1458292239] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-tetrarc
[1458292239] INFO: [Shinken] [All] Trying to send configuration to poller poller-master
[1458292239] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-master
[1458292239] INFO: [Shinken] [All] Trying to send configuration to poller poller-simedit
[1458292239] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-simedit
[1458292239] INFO: [Shinken] [All] OK, no more poller sent need
[1458292239] INFO: [Shinken] [All] Dispatching broker satellite with order: broker-master (spare:False),
[1458292239] INFO: [Shinken] [All] Trying to send configuration to broker broker-master
[1458292242] INFO: [Shinken] [All] Dispatch OK of configuration 0 to broker broker-master
[1458292242] INFO: [Shinken] [All] OK, no more broker sent need
[1458292242] INFO: [Shinken] [All] Dispatching receiver satellite with order: receiver-master (spare:False),
[1458292242] INFO: [Shinken] [All] Trying to send configuration to receiver receiver-master
[1458292242] INFO: [Shinken] [All] Dispatch OK of configuration 0 to receiver receiver-master
[1458292242] INFO: [Shinken]  (http://localhost:7773/)
[1458292242] INFO: [Shinken] [All] OK, no more receiver sent need
[1458292242] INFO: [Shinken] [All] Trying to send configuration to receiver receiver-master
[1458292242] INFO: [Shinken] [All] Dispatch OK of configuration to receiver receiver-master

Please, feel free to ask if you need more informations

geektophe commented 8 years ago

Are the services configured with localhost for address (for instance http://localhost:7773/) the result of your logs anonymization, or is it your real setup ?

torkam commented 8 years ago

Yes they are all configured with the localhost address, expect for the Scheduler daemon (using 10.8.0.1 -> address of the server on the VPN). I used an online tutorial saying that only the scheduler had to be accessible for the pollers :/ Maybe it was a mistake ? (this tutorial -> http://shinkenlab.io/online-course-6-lan-scalability/)

geektophe commented 8 years ago

In fact, I think your poller issue comes from that.

If I've correctly understood your setup, you have two pollers, each on a different machine, but both with localhost as address.

The addresses are not only important for the scheduler, but for all the services, because it's used by the arbiter to send each service its configuration. Using localhost works if you have a single machine. As soon as there's more, and you have more than one instance for a service, it's mandatory to use the real address (dns name or ip address, it's up to you).

Could you try to replace localhost with the hosts real address and tell me if the problem remains ?

torkam commented 8 years ago

I want to add that on the two pollers, only the poller deamon is started.

Sure I do that : On every service configuration file (arbiter, scheduler, poller-master, reactionner and broker) on the master (I have nothing to do on the pollers, right ?), I replace "localhost" by "10.8.0.1".

Correct ?

geektophe commented 8 years ago

Yes, except your second poller that should configured its own address :)

torkam commented 8 years ago

Okay so I did the modification requested. The IP (10.8.0.1) is configured in the following six files : arbiter-master.cfg, broker-master.cfg, poller-master.cfg, reactionner-master.cfg, receiver-master.cfg and scheduler-master.cfg. On the two pollers, the configuration of the file : poller-master.cfg is with the 'localhost' address.

As a confirmation, please find the ifconfig of the master (you will see that that address I am using is the one related to the tun0 network card) :

eth0      Link encap:Ethernet  HWaddr 00:50:56:a7:7b:5d
          inet adr:172.16.0.209  Bcast:172.16.0.255  Masque:255.255.255.0
          adr inet6: fe80::250:56ff:fea7:7b5d/64 Scope:Lien
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:106926592 errors:0 dropped:960226 overruns:0 frame:0
          TX packets:73598994 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:1000
          RX bytes:103117172720 (96.0 GiB)  TX bytes:13616107920 (12.6 GiB)

lo        Link encap:Boucle locale
          inet adr:127.0.0.1  Masque:255.0.0.0
          adr inet6: ::1/128 Scope:Hôte
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:211875754 errors:0 dropped:0 overruns:0 frame:0
          TX packets:211875754 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:0
          RX bytes:114477153077 (106.6 GiB)  TX bytes:114477153077 (106.6 GiB)

tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet adr:10.8.0.1  P-t-P:10.8.0.2  Masque:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:47271 errors:0 dropped:0 overruns:0 frame:0
          TX packets:44660 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:100
          RX bytes:6957146 (6.6 MiB)  TX bytes:5307558 (5.0 MiB)

But there are still problems in the logs :/

Arbiterd.log

[1458312949] TIMEPERIOD TRANSITION: NotSundayandMonday;-1;1
[1458312949] TIMEPERIOD TRANSITION: none;-1;0
[1458312949] TIMEPERIOD TRANSITION: 24x7;-1;1
[1458312949] TIMEPERIOD TRANSITION: workhours;-1;1
[1458313011] WARNING: [Shinken] Add failed attempt to scheduler-master (1/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313072] WARNING: [Shinken] Add failed attempt to scheduler-master (2/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313195] WARNING: [Shinken] Add failed attempt to scheduler-master (1/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313318] WARNING: [Shinken] Add failed attempt to scheduler-master (1/3) Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313322] WARNING: [Shinken] Scheduler scheduler-master did not managed its configuration 0, I am not happy.
[1458313322] WARNING: [Shinken] [All] The reactionner reactionner-master manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The poller poller-tetrarc manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The poller poller-master manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The poller poller-simedit manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The broker broker-master manage a unmanaged configuration
[1458313322] WARNING: [Shinken] [All] The receiver receiver-master manage a unmanaged configuration
[1458313322] INFO: [Shinken] Dispatching Realm All
[1458313322] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458313322] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458313322] INFO: [Shinken] [All] Dispatching configuration 0
[1458313322] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-master
[1458313326] ERROR: [Shinken] Failed sending configuration for scheduler-master: Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313326] WARNING: [Shinken] [All] configuration dispatching error for scheduler scheduler-master
[1458313326] WARNING: [Shinken] All schedulers configurations are not dispatched, 1 are missing
[1458313326] INFO: [Shinken] I ask reactionner-master to wait a new conf
[1458313328] INFO: [Shinken] I ask poller-simedit to wait a new conf
[1458313329] INFO: [Shinken] I ask poller-master to wait a new conf
[1458313330] INFO: [Shinken] I ask poller-tetrarc to wait a new conf
[1458313330] INFO: [Shinken] I ask broker-master to wait a new conf
[1458313340] INFO: [Shinken] I ask receiver-master to wait a new conf
[1458313354] INFO: [Shinken] Scheduler configuration 0 is unmanaged!!
[1458313354] WARNING: [Shinken] Missing satellite reactionner for configuration 0:
[1458313354] WARNING: [Shinken] Missing satellite poller for configuration 0:
[1458313354] WARNING: [Shinken] Missing satellite broker for configuration 0:
[1458313354] WARNING: [Shinken] Missing satellite receiver for configuration 0:
[1458313354] INFO: [Shinken] Dispatching Realm All
[1458313354] INFO: [Shinken] [All] Dispatching 1/1 configurations
[1458313354] INFO: [Shinken] [All] Schedulers order: scheduler-master
[1458313354] INFO: [Shinken] [All] Dispatching configuration 0
[1458313354] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-master
[1458313355] INFO: [Shinken] [All] Dispatch OK of conf in scheduler scheduler-master
[1458313355] INFO: [Shinken] OK, all schedulers configurations are dispatched :)
[1458313355] INFO: [Shinken] [All] Dispatching reactionner satellite with order: reactionner-master (spare:False),
[1458313355] INFO: [Shinken] [All] Trying to send configuration to reactionner reactionner-master
[1458313356] INFO: [Shinken] [All] Dispatch OK of configuration 0 to reactionner reactionner-master
[1458313356] INFO: [Shinken] [All] OK, no more reactionner sent need
[1458313356] INFO: [Shinken] [All] Dispatching poller satellite with order: poller-tetrarc (spare:False), poller-master (spare:False), poller-simedit (spare:False),
[1458313356] INFO: [Shinken] [All] Trying to send configuration to poller poller-tetrarc
[1458313357] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-tetrarc
[1458313357] INFO: [Shinken] [All] Trying to send configuration to poller poller-master
[1458313358] INFO: [Shinken] [All] Dispatch OK of configuration 0 to poller poller-master
[1458313358] INFO: [Shinken] [All] Trying to send configuration to poller poller-simedit

Pollerd.log

[1458312949] INFO: [Shinken] I correctly loaded the modules: []
[1458312950] INFO: [Shinken] [poller-master] Allocating new fork Worker: 0
[1458312950] INFO: [Shinken] [poller-master] Allocating new fork Worker: 1
[1458312950] INFO: [Shinken] [poller-master] Allocating new fork Worker: 2
[1458312951] INFO: [Shinken] [poller-master] Allocating new fork Worker: 3
[1458312954] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_cl                                   ient.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed                                    out after 3001 milliseconds with 0 bytes received
[1458312954] INFO: [Shinken] [poller-master] Init connection with scheduler-mast                                   er at http://10.8.0.1:7768/ (3s,120s)
[1458312957] WARNING: [Shinken] [poller-master] Scheduler scheduler-master is no                                   t initialized or has network problem: Connection error to http://10.8.0.1:7768/                                    : Operation timed out after 3001 milliseconds with 0 bytes received
[1458312958] INFO: [Shinken] [poller-master] Init connection with scheduler-mast                                   er at http://10.8.0.1:7768/ (3s,120s)
[1458312961] INFO: [Shinken] [poller-master] Connection OK with scheduler schedu                                   ler-master
[1458313228] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313228] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313228] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458313233] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313233] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313236] WARNING: [Shinken] [poller-master] Scheduler scheduler-master is not initialized or has network problem: Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313237] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313237] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458313241] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313241] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313244] WARNING: [Shinken] [poller-master] Scheduler scheduler-master is not initialized or has network problem: Connection error to http://10.8.0.1:7768/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1458313245] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)
[1458313247] INFO: [Shinken] [poller-master] Connection OK with scheduler scheduler-master
[1458313250] ERROR: [Shinken] manage_returns exception:: <class 'shinken.http_client.HTTPException'>,Connection error to http://10.8.0.1:7768/ : Operation timed out after 3000 milliseconds with 0 bytes received
[1458313250] INFO: [Shinken] [poller-master] Init connection with scheduler-master at http://10.8.0.1:7768/ (3s,120s)

geektophe commented 8 years ago

What I see is that there is a communication issue with scheduler on 10.8.0.1 (timeout).

Are you sure your services correctly listen on this address ? (check with netstat -lntp) Nor the arbiter nor the pollers manage to connect to the scheduler, but the arbiter seems to manage to send the configuration to the pollers.

There's clearly odd things happening because of your tun0 device.

Could you try your setup with openvpn turned off and with addresses on the same subnet ?

geektophe commented 8 years ago

If it still fails, could you paste all your master-*.cfg files, plus the ifconfig output of each of your machines ? (with their name in in the configuration).

torkam commented 8 years ago

Hey Geektophe. Hope your week end was good :)

_Here is the result of the Netstat : _

Proto Recv-Q Send-Q Adresse locale          Adresse distante        Etat        PID/Program name
tcp        0      0 127.0.0.1:27017         0.0.0.0:*               LISTEN      1999/mongod
tcp        0      0 127.0.0.1:40910         0.0.0.0:*               LISTEN      13501/python2.7
tcp        0      0 127.0.0.1:57519         0.0.0.0:*               LISTEN      13543/python2.7
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1690/rpcbind
tcp        0      0 0.0.0.0:50000           0.0.0.0:*               LISTEN      13630/python2.7
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      3156/nginx
tcp        0      0 127.0.0.1:28017         0.0.0.0:*               LISTEN      1999/mongod
tcp        0      0 127.0.0.1:42003         0.0.0.0:*               LISTEN      13366/python2.7
tcp        0      0 0.0.0.0:2003            0.0.0.0:*               LISTEN      2264/python
tcp        0      0 0.0.0.0:2004            0.0.0.0:*               LISTEN      2264/python
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      3246/sshd
tcp        0      0 0.0.0.0:7767            0.0.0.0:*               LISTEN      13601/python2.7
tcp        0      0 0.0.0.0:87              0.0.0.0:*               LISTEN      3156/nginx
tcp        0      0 127.0.0.1:3031          0.0.0.0:*               LISTEN      2222/uwsgi
tcp        0      0 0.0.0.0:7768            0.0.0.0:*               LISTEN      13321/python2.7
tcp        0      0 0.0.0.0:7769            0.0.0.0:*               LISTEN      13410/python2.7
tcp        0      0 127.0.0.1:50905         0.0.0.0:*               LISTEN      13323/python2.7
tcp        0      0 0.0.0.0:9465            0.0.0.0:*               LISTEN      17353/openvpn
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      3103/exim4
tcp        0      0 10.8.0.1:7770           0.0.0.0:*               LISTEN      13542/python2.7
tcp        0      0 0.0.0.0:7002            0.0.0.0:*               LISTEN      2264/python
tcp        0      0 0.0.0.0:53978           0.0.0.0:*               LISTEN      1721/rpc.statd
tcp        0      0 127.0.0.1:52411         0.0.0.0:*               LISTEN      13458/python2.7
tcp        0      0 0.0.0.0:7771            0.0.0.0:*               LISTEN      13364/python2.7
tcp        0      0 0.0.0.0:7772            0.0.0.0:*               LISTEN      13456/python2.7
tcp        0      0 0.0.0.0:7773            0.0.0.0:*               LISTEN      13499/python2.7
tcp        0      0 127.0.0.1:38209         0.0.0.0:*               LISTEN      13412/python2.7
tcp6       0      0 :::111                  :::*                    LISTEN      1690/rpcbind
tcp6       0      0 :::80                   :::*                    LISTEN      2572/apache2
tcp6       0      0 :::22                   :::*                    LISTEN      3246/sshd
tcp6       0      0 :::87                   :::*                    LISTEN      3156/nginx
tcp6       0      0 ::1:25                  :::*                    LISTEN      3103/exim4
tcp6       0      0 :::33638                :::*                    LISTEN      1721/rpc.statd

When I tried with openvpn turned off, and with pollers I was able to access without it, it had the same behavior :/ :(

Here are the master files : Arbiter-master.cfg

define arbiter {
    arbiter_name    arbiter-master
    #host_name      node1               ; CHANGE THIS if you have several Arbiters (like with a spare)
    address         10.8.0.1    ; DNS name or IP
    port            7770
    spare           0           ; 1 = is a spare, 0 = is not a spare

    ## Interesting modules:
    # - named-pipe               = Open the named pipe nagios.cmd
    # - mongodb                  = Load hosts from a mongodb database
    # - pickle-retention-arbiter = Save data before exiting
    # - nsca                     = NSCA server
    # - vmware-auto-linking      = Lookup at Vphere server for dependencies
    # - import-glpi              = Import configuration from GLPI (need plugin monitoring for GLPI in server side)
    # - tsca                     = TSCA server
    # - mysql-mport              = Load configuration from a MySQL database
    # - ws-arbiter               = WebService for pushing results to the arbiter
    # - collectd                 = Receive collectd perfdata
    # - snmp-booster             = Snmp bulk polling module, configuration linker
    # - import-landscape         = Import hosts from Landscape (Ubuntu/Canonical management tool)
    # - aws                     = Import hosts from Amazon AWS (here EC2)
    # - ip-tag                  = Tag a host based on it's IP range
    # - file-tag                        = Tag a host if it's on a flat file
    # - csv-tag                 = Tag a host from the content of a CSV file

    modules

    # Enable https or not
    use_ssl               0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    ## Uncomment these lines in a HA architecture so the master and slaves know
    ## how long they may wait for each other.
    #timeout              3   ; Ping timeout
    #data_timeout         120 ; Data send timeout
    #max_check_attempts   3   ; If ping fails N or more, then the node is dead
    #check_interval       60  ; Ping node every N seconds
}

Broker-master.cfg

define broker {
    broker_name     broker-master
    address         10.8.0.1
    port            7772
    spare           0

    ## Optional
    manage_arbiters     1   ; Take data from Arbiter. There should be only one
                            ; broker for the arbiter.
    manage_sub_realms   1   ; Does it take jobs from schedulers of sub-Realms?
    timeout             3   ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    ## Modules
    # Default: None
    # Interesting modules that can be used:
    # - simple-log              = just all logs into one file
    # - livestatus              = livestatus listener
    # - tondodb-mysql           = NDO DB support (deprecated)
    # - npcdmod                 = Use the PNP addon
    # - graphite                = Use a Graphite time series DB for perfdata
    # - webui                   = Shinken Web interface
    # - glpidb                  = Save data in GLPI MySQL database
    modules webui2, graphite, livestatus, mongo-logs

    # Enable https or not
    use_ssl               0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    ## Advanced
    realm   All
}

Poller-master.cfg

define poller {
    poller_name     poller-master
    address         10.8.0.1
    port            7771

    ## Optional
    spare               0   ; 1 = is a spare, 0 = is not a spare
    manage_sub_realms   0   ; Does it take jobs from schedulers of sub-Realms?
    min_workers         0   ; Starts with N processes (0 = 1 per CPU)
    max_workers         0   ; No more than N processes (0 = 1 per CPU)
    processes_by_worker 256 ; Each worker manages N checks
    polling_interval    1   ; Get jobs from schedulers each N seconds
    timeout             3   ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    ## Interesting modules that can be used:
    # - booster-nrpe     = Replaces the check_nrpe binary. Therefore it
    #                     enhances performances when there are lot of NRPE
    #                     calls.
    # - named-pipe     = Allow the poller to read a nagios.cmd named pipe.
    #                     This permits the use of distributed check_mk checks
    #                     should you desire it.
    # - snmp-booster     = Snmp bulk polling module
    modules

    ## Advanced Features
    #passive         0       ; For DMZ monitoring, set to 1 so the connections
                             ; will be from scheduler -> poller.

    # Poller tags are the tag that the poller will manage. Use None as tag name to manage
    # untaggued checks
    #poller_tags     None

    # Enable https or not
    use_ssl               0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    realm   All
}

Reactionner-master.cfg

define reactionner {
    reactionner_name    reactionner-master
    address             10.8.0.1
    port                7769
    spare               0

    ## Optionnal
    manage_sub_realms   0   ; Does it take jobs from schedulers of sub-Realms?
    min_workers         1   ; Starts with N processes (0 = 1 per CPU)
    max_workers         15  ; No more than N processes (0 = 1 per CPU)
    polling_interval    1   ; Get jobs from schedulers each 1 second
    timeout             3   ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    ## Modules
    modules

    # Reactionner tags are the tag that the reactionner will manage. Use None as tag name to manage
    # untaggued notification/event handlers
    #reactionner_tags     None

    # Enable https or not
    use_ssl               0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    ## Advanced
    realm   All
}

Receiver-master.cfg

define receiver {
    receiver_name   receiver-master
    address         10.8.0.1
    port            7773
    spare           0

    ## Optional parameters
    timeout             3   ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    ## Modules for Receiver
    # - named-pipe             = Open the named pipe nagios.cmd
    # - nsca                    = NSCA server
    # - tsca                    = TSCA server
    # - ws-arbiter              = WebService for pushing results to the arbiter
    # - collectd                = Receive collectd perfdata
    modules

    # Enable https or not
    use_ssl               0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check  0

    ## Advanced Feature
    direct_routing      0   ; If enabled, it will directly send commands to the
                            ; schedulers if it knows about the hostname in the
                            ; command.
    realm   All
}

Scheduler-master.cfg

define scheduler {
    scheduler_name      scheduler-master ; Just the name
    address             10.8.0.1        ; IP or DNS address of the daemon
    port                7768            ; TCP port of the daemon

    ## Optional
    spare               0   ; 1 = is a spare, 0 = is not a spare
    weight              1   ; Some schedulers can manage more hosts than others
    timeout             10  ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    ## Interesting modules that can be used:
    # - pickle-retention-file     = Save data before exiting in flat-file
    # - mem-cache-retention   = Same, but in a MemCache server
    # - redis-retention      = Same, but in a Redis server
    # - retention-mongodb    = Same, but in a MongoDB server
    # - nagios-retention     = Read retention info from a Nagios retention file
    #                         (does not save, only read)
    # - snmp-booster             = Snmp bulk polling module
    modules

    ## Advanced Features
    # Realm is for multi-datacenters
    realm   All

    # Skip initial broks creation. Boot fast, but some broker modules won't
    # work with it! (like livestatus for example)
    skip_initial_broks  0

    # In NATted environments, you declare each satellite ip[:port] as seen by
    # *this* scheduler (if port not set, the port declared by satellite itself
    # is used)
    #satellitemap    poller-1=1.2.3.4:7771, reactionner-1=1.2.3.5:7769, ...

    # Enable https or not
    use_ssl               0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0
}

And please find bellow the ifconfig output of each machine : Master ! (name APS-SHINKEN)

eth0      Link encap:Ethernet  HWaddr 00:50:56:a7:7b:5d
          inet adr:172.16.0.209  Bcast:172.16.0.255  Masque:255.255.255.0
          adr inet6: fe80::250:56ff:fea7:7b5d/64 Scope:Lien
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:149704792 errors:0 dropped:1197564 overruns:0 frame:0
          TX packets:103857328 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:1000
          RX bytes:143378425822 (133.5 GiB)  TX bytes:19075490868 (17.7 GiB)

lo        Link encap:Boucle locale
          inet adr:127.0.0.1  Masque:255.0.0.0
          adr inet6: ::1/128 Scope:Hôte
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:300231463 errors:0 dropped:0 overruns:0 frame:0
          TX packets:300231463 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:0
          RX bytes:139969776767 (130.3 GiB)  TX bytes:139969776767 (130.3 GiB)

tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet adr:10.8.0.1  P-t-P:10.8.0.2  Masque:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:2315062 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1878751 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:100
          RX bytes:280240229 (267.2 MiB)  TX bytes:185767322 (177.1 MiB)

Poller-Tetrarc ! (name APS-SHINKEN-POLLER)

eth0      Link encap:Ethernet  HWaddr 00:0c:29:db:99:a9
          inet adr:192.168.5.221  Bcast:192.168.5.255  Masque:255.255.255.0
          adr inet6: fe80::20c:29ff:fedb:99a9/64 Scope:Lien
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16972058 errors:0 dropped:32 overruns:0 frame:0
          TX packets:12788698 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:1000
          RX bytes:4984429859 (4.6 GiB)  TX bytes:2460385992 (2.2 GiB)

lo        Link encap:Boucle locale
          inet adr:127.0.0.1  Masque:255.0.0.0
          adr inet6: ::1/128 Scope:Hôte
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:11240530 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11240530 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:0
          RX bytes:1131973704 (1.0 GiB)  TX bytes:1131973704 (1.0 GiB)

tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet adr:10.8.0.31  P-t-P:10.8.0.29  Masque:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:2766907 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2269159 errors:0 dropped:179 overruns:0 carrier:0
          collisions:0 lg file transmission:100
          RX bytes:273196445 (260.5 MiB)  TX bytes:273976207 (261.2 MiB)

Poller-Simedit ! (name APS-SHINKEN-POLLER)

eth0      Link encap:Ethernet  HWaddr 00:50:56:8a:7d:5e
          inet adr:192.168.51.212  Bcast:192.168.51.255  Masque:255.255.255.0
          adr inet6: fe80::250:56ff:fe8a:7d5e/64 Scope:Lien
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11440625 errors:0 dropped:14788 overruns:0 frame:0
          TX packets:5790634 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:1000
          RX bytes:1403722971 (1.3 GiB)  TX bytes:1108638282 (1.0 GiB)

lo        Link encap:Boucle locale
          inet adr:127.0.0.1  Masque:255.0.0.0
          adr inet6: ::1/128 Scope:Hôte
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:8610438 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8610438 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:0
          RX bytes:683497358 (651.8 MiB)  TX bytes:683497358 (651.8 MiB)

tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet adr:10.8.0.30  P-t-P:10.8.0.29  Masque:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:3105146 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3859160 errors:0 dropped:619727 overruns:0 carrier:0
          collisions:0 lg file transmission:100
          RX bytes:330039673 (314.7 MiB)  TX bytes:526414428 (502.0 MiB)

Here are all the informations requested :). Please ask if you need anything else

naparuba commented 8 years ago

Is the problem still open?

torkam commented 8 years ago

Hello Naparuba.

Yes, unfortunately, the problem is still open :/. I was waiting for an answer of geektophe but maybe it's more complex than expected :/

shinken-solutions / shinken

Shinken server reboot when multiple pollers configured #1823