opinkerfi / adagios

Adagios - Web Based Nagios Configuration
GNU Affero General Public License v3.0
330 stars 75 forks source link

okconfig->Add host. How to customise 'define host' and 'define service' in *-host.cfg ? #568

Closed spicysomtam closed 9 years ago

spicysomtam commented 9 years ago

I wanted to override the defaults for this primarily as we need longer ping timeouts than in your default settings (held in /usr/share/okconfig).

The default settings specify 'use okc-default-host' in the 'define host' and 'use okc-check_ping' for the service host Ping check. Following these through, okc-default-host uses 'check_command check-host-alive'. check-host-alive is an object in /etc/nagios/objects/commands.cfg, so it can be customised. Service okc-check_ping' uses 'check_command okc-check_ping' and specifies custom variables. command and service ok-check_ping live under /usr/share/okconfig and should not be customised. However the custom variables could be specified at the highest level in the -hosts.cfg for the host, but how is that customised?

Thus in summary, I could customise the default host check via tweaking check-host-alive in /etc/nagios/objects/commands.cfg. I am ok with that.

I could also update the service okc-check_ping by changing the values of the custom variables, but that would update /usr/share/okconfig/templates/misc/services.cfg, and adagios web gui won't allow this. Thus the answer is to set the custom variables in the *-host.cfg file? Thus how is the host service check customised?

Also I observe:

-Web page does not allow 'define host' 'use' to be specified, although can be specified on okconfig command line (--use).

-Cannot override 'use okc-check_ping' in the service check (similar to above).

-What is --host_template in the okconfig command and where would a custom one live if you created one?

hakong commented 9 years ago

I havent tried this myself, but I expect you could copy /usr/share/okconfig/examples/host.cfg-example to somewhere and edit:

# We are still using Ping as a service in addition to check-host-alive, all hosts
# deserve a ping
define service {
    use                         okc-check_ping
    host_name                   HOSTNAME
    service_description         Ping

    #__CRITICAL_PACKETLOSS      40%
    #__CRITICAL_ROUND_TRIP      500.0
    #__WARNING_PACKETLOSS       20%
    #__WARNING_ROUND_TRIP       100.0
}

either change the ping template or uncomment the macros (__CRITICAL_PACKETLOSS etc...) and increase the timeout. Then use okconfig on the command line to install and specify the new host template file.

spicysomtam commented 9 years ago

Wow, that works! Thanks for the reply.

Copied /usr/share/okconfig/examples/host.cfg-example to /etc/nagios/okconfig/examples/ and then edited insitu. Removed the comments for the custom variables and set them to the values I want, and it works!

define host {
    use                         PARENTHOST
    host_name                   HOSTNAME
    address                     IPADDR
    alias                       ALIAS
    contact_groups              GROUP
    hostgroups                  GROUP
};

# This is a template service for HOSTNAME
# Services that belong to this host should use this as a template
define service {
    name                        HOSTNAME
    use                         GROUP-default_service
    host_name                   HOSTNAME
    contact_groups              GROUP
    service_groups              GROUP
    register                    0
}

# We are still using Ping as a service in addition to check-host-alive, all hosts
# deserve a ping
define service {
    use                         okc-check_ping
    host_name                   HOSTNAME
    service_description         Ping
    __CRITICAL_PACKETLOSS      100%
    __CRITICAL_ROUND_TRIP      5000.0
    __WARNING_PACKETLOSS       80%
    __WARNING_ROUND_TRIP       3000.0
}
hakong commented 9 years ago

Cool. I'm curious, what kind of hosts need this kind of high RTA thresholds? Sattelite connection?

spicysomtam commented 9 years ago

I took them from the old nagios config thats been running for 6 years or so :) Basically we are in the UK and have two data centers in the US (mid west). I guess they need reviewing as they are system wide, but the standard ones in your templates were alarming for the US ones. Got to leave, but will look at the plugin closer tomorrow, try to tune it, and get back to you.

The old nagios config is horrible (XI based). Every service definition has the check interval, retry interval and retries defined. I am trying to get away from that, and set those in a 'use' template; eg a single standard template used across all Linux services. Simpler and cleaner. Basically ping/host checks are every 5 mins and service checks once an hour. Puppet should do the restart of failed services in between meaning less alerts.

Your help much appreciated, and hopefully this will help someone else. Kind of filling in the documentation gaps.

spicysomtam commented 9 years ago

Regarding the long rta's; I think this was put inplace to prevent network issues creating lots of notifications. Basically if there are any intermittent comms issues, it will take longer to alert. I will stick with the 3s/5s settings for now.

I decided to redo this. I removed those custom variables and created copies of okc default host and check ping, so I can customise the check intervals via a use. Otherwise if I want to customise these later, I need to do a bulk edit of the *-host.cfg files. New /etc/nagios/okconfig/examples/host.cfg-example:

# Use co-* templates copied from the defaults so the check intervals/values can be customised.
define host {
    use                         co-default-host
    host_name                   HOSTNAME
    address                     IPADDR
    alias                       ALIAS
    contact_groups              GROUP
    hostgroups                  GROUP
};

# This is a template service for HOSTNAME
# Services that belong to this host should use this as a template
define service {
    name                        HOSTNAME
    use                         GROUP-default_service
    host_name                   HOSTNAME
    contact_groups              GROUP
    service_groups              GROUP
    register                    0
}

# We are still using Ping as a service in addition to check-host-alive, all hosts
# deserve a ping.
define service {
    use                         co-default-ping
    host_name                   HOSTNAME
    service_description         Ping
}
hakong commented 9 years ago

Cool. If you want to minimize alerts during network issues, you could use network parents.