shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

Host cfg file deleted, shinken-arbiter restarted but host still in webui2 ... #1916

Open serge-marie opened 7 years ago

serge-marie commented 7 years ago

Hi,

I just deleted cfg file of several hosts :

[shinken@arbiter-master ~]$ find /etc/shinken/hosts/ -name WP000INF0051.cfg [shinken@arbiter-master ~]$ find /etc/shinken/hosts/ -name WP000INF0070.cfg [shinken@arbiter-master ~]$ find /etc/shinken/hosts/ -name WP000INF0071.cfg [shinken@arbiter-master ~]$ find /etc/shinken/hosts/ -name WP000INF0087.cfg [shinken@arbiter-master ~]$ ls -al /etc/shinken/hosts/ | wc -l 3715

As you can see cfg file is not present in /etc/shinken/hosts

I have done a "service shinken-arbiter restart" on the arbiter master but hosts are still in webui2 :

image

I'm using shinken 2.4.3 and webui 2.4.2c

image

I'm also using mongodb-retention and mongodb-dt-ct-retention modules for data rentention.

Thank's in advance.

Serge

geektophe commented 7 years ago

If you grep the broker and scheduler logs, do you see the message indicating that a new configuration has been loaded ?

grep "New configuration loaded" /var/lib/shinken/schedulerd.log
grep "We have our schedulers" /var/lib/shinken/brokerd.log

The timestamps should be older than the arbiter restart time.

serge-marie commented 7 years ago

Hi,

I have the first one (New configuration loaded) on the 2 pollers :

[shinken@lp063inf8505 ~]$ ps -efd | grep -i "arbiter" shinken 67040 1 38 13:56 ? 00:00:29 python2.7 /usr/bin/shinken-arbiter -d -c /etc/shinken/shinken.cfg shinken 67041 67040 0 13:56 ? 00:00:00 python2.7 /usr/bin/shinken-arbiter -d -c /etc/shinken/shinken.cfg shinken 67061 67040 0 13:56 ? 00:00:00 python2.7 /usr/bin/shinken-arbiter -d -c /etc/shinken/shinken.cfg shinken 70792 126971 0 13:57 pts/11 00:00:00 grep --color=auto -i arbiter

On the Pollers : [shinken@lp063inf8507 ~]$ date -d @grep "New configuration" /var/log/shinken/schedulerd.log | tail -n 1 | cut -d" " -f 1 | sed -e 's/\[//g' -e 's/\]//g' Tue Jun 20 13:57:02 CEST 2017 [shinken@lp063inf8508 ~]$ date -d @grep "New configuration" /var/log/shinken/schedulerd.log | tail -n 1 | cut -d" " -f 1 | sed -e 's/\[//g' -e 's/\]//g' Tue Jun 20 13:56:49 CEST 2017

I have in arbiterd.log :

141285:[1497959781] INFO: [Shinken] Checking schedulers... 141286:[1497959781] INFO: [Shinken] Checked 4 schedulers 142111:[1497959801] INFO: [Shinken] [All] Schedulers order: scheduler-lp063inf8509,scheduler-lp063inf8510,scheduler-lp063inf8507,scheduler-lp063inf8508 142113:[1497959801] INFO: [Shinken] [All] Trying to send conf 0 to scheduler scheduler-lp063inf8508 142114:[1497959802] INFO: [Shinken] [All] Dispatch OK of conf in scheduler scheduler-lp063inf8508 142116:[1497959802] INFO: [Shinken] [All] Trying to send conf 1 to scheduler scheduler-lp063inf8507 142117:[1497959804] INFO: [Shinken] [All] Dispatch OK of conf in scheduler scheduler-lp063inf8507 142118:[1497959804] INFO: [Shinken] OK, all schedulers configurations are dispatched :)

Serge

geektophe commented 7 years ago

OK, so you should have a message displayed in your scheduler log indicating that the broker came and fetched the initial broks.

grep 'Created [0-9]\+ initial Broks for broker' /var/lib/shinken/schedulerd.log

If not so, could you try to restart the broker service and check if the hosts well disappeared ?

serge-marie commented 7 years ago

Hi,

I have :

[shinken@lp063inf8505 ~]$ grep "Created [0-9]+ initial Broks for broker" /var/log/shinken/schedulerd.log | head -n 1 [1497959844] INFO: [Shinken] [scheduler-lp063inf8508] Created 20717 initial Broks for broker broker-lp063inf8505 [shinken@lp063inf8505 ~]$ date -d @1497959844 Tue Jun 20 13:57:24 CEST 2017

Serge

serge-marie commented 7 years ago

I had to reboot arbiters (master and spare) and hosts are gone on webui ...

geektophe commented 7 years ago

When you say reboot, you mean physically reboot ?

serge-marie commented 7 years ago

yes :(

Le mer. 21 juin 2017 à 14:32, Christophe Simon notifications@github.com a écrit :

When you say reboot, you mean physically reboot ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/naparuba/shinken/issues/1916#issuecomment-310064268, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6OCI5TBC6SYrCYi3biJNE99kaa_ue0ks5sGQ12gaJpZM4N_Whp .

-- Serge MARIÉ

geektophe commented 7 years ago

That's quite weird... Perhaps is there a problem with the startup script you use.

Which distribution are you using, and what's your deployment process ?

serge-marie commented 7 years ago

Hi,

Rh7.2.

What do you mean with "what's your deployment process" ?

We use systemctl or service commands

Serge Le mer. 21 juin 2017 à 15:53, Christophe Simon notifications@github.com a écrit :

That's quite weird... Perhaps is there a problem with the startup script you use.

Which distribution are you using, and what's your deployment process ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/naparuba/shinken/issues/1916#issuecomment-310085394, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6OCIziKbeSoBH3Y-74efk-X2J6Ehl5ks5sGSBUgaJpZM4N_Whp .

-- Serge MARIÉ

geektophe commented 7 years ago

By deployment process, I mean, how do you deploy your configuration, and which services are you restarting ? Do you synchronize configuration between arbiters, and restart the shinken-arbiter service on both the servers ?

serge-marie commented 7 years ago

We create cfg files, we manage the arbiter daemon on the master active only (reload when we create a new cfg file, restart when we remove a cfg file)

The /etc/shinken folder is synchronized with a git repository on the 2 arbiters.

geektophe commented 7 years ago

I personally restart both the arbiters (slave, then master) to ensure I've always the lastest configuration, even if the master fails to start.

Would you mind to have another similar test (rollback your hosts deletion, then delete them again) with your current process to see if the problem remains ? If so, could you try to restart both the schedulers as I do ?

Many thanks !

serge-marie commented 7 years ago

Ok I will try when the problem will be there again. Le jeu. 22 juin 2017 à 10:23, Christophe Simon notifications@github.com a écrit :

I personally restart both the arbiters (slave, then master) to ensure I've always the lastest configuration, even if the master fails to start.

Would you mind to have another similar test (rollback your hosts deletion, then delete them again) with your current process to see if the problem remains ? If so, could you try to restart both the schedulers as I do ?

Many thanks !

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/naparuba/shinken/issues/1916#issuecomment-310311376, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6OCHatHr-zRsbyvr5lrawbYpqsuSVxks5sGiSGgaJpZM4N_Whp .

-- Serge MARIÉ