hilbert / hilbert-cli

Backend management tools: CLI
Apache License 2.0
6 stars 2 forks source link

Station state toggles on and off after power off #81

Closed elondaits closed 6 years ago

elondaits commented 6 years ago

After stopping bigfoot80 I received "station down" and "station up" via CheckMK intermittently... it'd change like every minute.

The state would change from:

{
        "id": "bigfoot80",
        "state": 1,
        "state_type": 1,
        "app_state": 0,
        "app_state_type": 1,
        "app_id": "HITS_IllustrisExplorer"
    },

to:

{
        "id": "bigfoot80",
        "state": 0,
        "state_type": 1,
        "app_state": 0,
        "app_state_type": 1,
        "app_id": "HITS_IllustrisExplorer"
    },
malex984 commented 6 years ago

i really need to know where this data comes from (which query is used to get it). there are 2 checks / services that provide APP_ID: dockapp_top1 and dockapp_heartbeat. Where the state and state_type come from?

elondaits commented 6 years ago

What I copied above is the merge of two queries. First I do

.get('hosts')
      .columns(['name', 'state', 'state_type'])
      .asColumns(['id', 'state', 'state_type'])

and then

.get('services')
      .columns(['host_name', 'state', 'state_type', 'plugin_output'])
      .asColumns(['id', 'app_state', 'app_state_type', 'app_id'])
      .filter('description = dockapp_top1')

so state and state_type come from the HOSTS query. The syntax above is of my wrapper functions but it's a direct map to the CheckMK query syntax... get('hosts') gets translated to GET hosts, etc.

malex984 commented 6 years ago

@elondaits thanks. i will try to recover the monitoring history. please let me know ASAP if you see this effect again!

elondaits commented 6 years ago

I won't test in the HITS server again until I have something new to test, so I won't see it soon probably. I assume that if you try it yourself you'll see the same thing, since I didn't do anything special, just stop the server with the dashboard and wait.

malex984 commented 6 years ago

OK, the problem is that Intel AMT was configured to provide PING responses on those hosts, but was missing a few pings sometimes. Therefore OMD was able to PING them, but with some lost PINGs. Apparently host state is determined by OMD/CheckMK depending on whether host can be PING'ed or not. Therefore hosts were switching between UP and DOWN state...

malex984 commented 6 years ago

Easy solution: disable AMT pings via http://HOST_NAME_OR_IP:16992/ip.htm (user is usually admin + some pre-configured password is required). NOTE that http! ps: https://www.symantec.com/connect/articles/who-responding-ping-intel-amt-or-os