shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

[2.0.3 & 2.2] Service state worsening isn't (re)notifying contacts #1453

Open openglx opened 9 years ago

openglx commented 9 years ago

This issue is incredibly similar to what we was reported in #1329 - service notifications are not being sent when a service goes from warning to critical.

Our notification goals are similar: groups receive e-mails for any case (recovery, warning, critical) but on-call mobiles only receive for critical and recovery (of a critical).

Tests on 2.0.3 using one host, one service, one notification command, two contacts in a group. Each contact has a different notification way:

# Just for test...
interval_length=1

retention_update_interval=60
max_service_check_spread=5
max_host_check_spread=5
service_check_timeout=60
timeout_exit_status=2
flap_history=20
max_plugins_output_length=65536
enable_problem_impacts_states_change=1
disable_old_nagios_parameters_whining=0
enable_environment_macros=0
log_initial_states=0
no_event_handlers_during_downtimes=1
pack_distribution_file=/var/lib/shinken/pack_distribution.dat
workdir=/var/lib/shinken/
lock_file=/var/run/shinken/arbiterd.pid
local_log=/var/log/shinken/arbiterd.log
shinken_user=nagios
shinken_group=nagios
modules_dir=/var/lib/shinken/modules
daemon_enabled=1
use_ssl=0
ca_cert=/etc/shinken/certs/ca.pem
server_cert=/etc/shinken/certs/server.cert
server_key=/etc/shinken/certs/server.key
hard_ssl_name_check=0
http_backend=auto

###############################################################################

define command {
    command_name check_host
    command_line /bin/true
}

define command {
    command_name check_service
    command_line /usr/local/bin/check_service.sh
}

define command {
    command_name notify
    command_line /usr/local/bin/notify.sh
}

###############################################################################

define timeperiod{
        timeperiod_name                 24x7
        alias                           Always
        sunday                          00:00-24:00
        monday                          00:00-24:00
        tuesday                         00:00-24:00
        wednesday                       00:00-24:00
        thursday                        00:00-24:00
        friday                          00:00-24:00
        saturday                        00:00-24:00
}

###############################################################################

define notificationway{
    notificationway_name            email_group
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r,s
    host_notification_commands      notify
    service_notification_commands   notify
    min_business_impact 1
}

define notificationway{
    notificationway_name            email_oncall
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    c,r
    host_notification_options       d,u,r,s
    host_notification_commands      notify
    service_notification_commands   notify
    min_business_impact 2
}

###############################################################################

define contact {
    name                            contact-mailgroup
    register                        0
    host_notifications_enabled      1
    service_notifications_enabled   1
    notificationways                email_group
}
define contact {
    name                            contact-oncall
    register                        0
    host_notifications_enabled      1
    service_notifications_enabled   1
    notificationways                email_oncall
}

define contact {
        use contact-mailgroup
    contact_name EMAIL_group
    alias me
    email me@domain.tld
    password me
    address1 0166666666
    is_admin 1
        contactgroups Group1
}

define contact {
        use contact-oncall
    contact_name EMAIL_oncall_mobile
        alias myphone
        email phone@address
        password none
        contactgroups Group1
}

###############################################################################

define contactgroup{
  contactgroup_name     NOC
  alias                 Network Operations Centre
  contactgroup_members  Group1
}

define contactgroup{
  contactgroup_name     Group1
  alias                 subgroup of NOC
}

###############################################################################

define host{
    name                host-generic
    max_check_attempts        3  
    check_interval            1  
    active_checks_enabled        1
    check_period            24x7
    notification_interval        0 
    notification_period        24x7
    notification_options        d,u,r 
    notifications_enabled        1
    event_handler_enabled        0
    flap_detection_enabled        0 
    process_perf_data        1
    register 0
}

define host{
    name                host-group1
    use                 host-generic
    contacts            EMAIL_group,EMAIL_oncall_mobile
    contact_groups       NOC
    register            0
}

define host {
        use host-group1
    host_name somehost
    address somehost.tld
    check_command check_host
}

###############################################################################

define service {
    name             service-interval-1min
    check_interval   1
    retry_interval   1
    register         0
    use              service-generic
}

define service{
        name                            service-generic         
        active_checks_enabled           1                       
        passive_checks_enabled          1                               
        parallelize_check               1                       
        obsess_over_service             1                       
        check_freshness                 1                       
        freshness_threshold             3600
        notifications_enabled           1                       
        notification_interval           0     
        notification_period             24x7  
        event_handler_enabled           0                       
        flap_detection_enabled          0     
        failure_prediction_enabled      1                       
        process_perf_data               1                       
        retain_status_information       1                       
        retain_nonstatus_information    1                       
        is_volatile                     0                       
        check_period                    24x7                    
        max_check_attempts              3                       
        notification_options            w,u,c,r                 
        stalking_options                o,w,u,c
        register                        0                       
}

define service {
        use service-interval-1min
    service_description Service
        host_name somehost
    check_command check_service
}

After starting Shinken 2.0.3 (check_service returning 0):

2015-01-13 14:30:37,278 [1421159437] SERVICE ALERT: somehost;Service;OK;HARD;3;exit code is 0

Changing check_service to return 1, only notifies the group [as it is the expected behaviour]:

2015-01-13 14:31:18,351 [1421159478] SERVICE ALERT: somehost;Service;WARNING;SOFT;1;exit code is 1
2015-01-13 14:31:22,359 [1421159482] SERVICE ALERT: somehost;Service;WARNING;SOFT;2;exit code is 1
2015-01-13 14:31:26,366 [1421159486] SERVICE ALERT: somehost;Service;WARNING;HARD;3;exit code is 1
2015-01-13 14:31:26,368 [1421159486] SERVICE NOTIFICATION: EMAIL_group;somehost;Service;WARNING;notify;exit code is 1

Changing it to return 2 we would expect both contacts to receive a notification, but only the on-call are receiving [we expect both to receive]:

2015-01-13 14:32:33,504 [1421159553] SERVICE ALERT: somehost;Service;CRITICAL;HARD;3;exit code is 2
2015-01-13 14:32:33,505 [1421159553] SERVICE NOTIFICATION: EMAIL_oncall_mobile;somehost;Service;CRITICAL;notify;exit code is 2

Clearing it to OK (check_service returning 0) notifies both contacts [as expected, as it should have notified both]:

2015-01-13 14:33:33,618 [1421159613] SERVICE ALERT: somehost;Service;OK;HARD;3;exit code is 0
2015-01-13 14:33:33,619 [1421159613] SERVICE NOTIFICATION: EMAIL_group;somehost;Service;OK;notify;exit code is 0
2015-01-13 14:33:33,619 [1421159613] SERVICE NOTIFICATION: EMAIL_oncall_mobile;somehost;Service;OK;notify;exit code is 0

For the record, warning->ok and ok->warning are working as expected [only the contact named "group" receives those]:

2015-01-13 14:34:05,676 [1421159645] SERVICE ALERT: somehost;Service;WARNING;SOFT;1;exit code is 1
2015-01-13 14:34:09,683 [1421159649] SERVICE ALERT: somehost;Service;WARNING;SOFT;2;exit code is 1
2015-01-13 14:34:14,692 [1421159654] SERVICE ALERT: somehost;Service;WARNING;HARD;3;exit code is 1
2015-01-13 14:34:14,695 [1421159654] SERVICE NOTIFICATION: EMAIL_group;somehost;Service;WARNING;notify;exit code is 1
2015-01-13 14:34:28,724 [1421159668] SERVICE ALERT: somehost;Service;OK;HARD;3;exit code is 0
2015-01-13 14:34:28,725 [1421159668] SERVICE NOTIFICATION: EMAIL_group;somehost;Service;OK;notify;exit code is 0

It is worth mentioning that if it transitions from OK->CRITICAL both contacts are receiving the notification, and eventual recovery:

[1421161278] INFO: [Shinken] Stalking Service: exit code is 0
[1421161278] SERVICE ALERT: somehost;Service;OK;HARD;3;exit code is 0
[1421161278] SERVICE NOTIFICATION: EMAIL_group;somehost;Service;OK;notify;exit code is 0
[1421161278] SERVICE NOTIFICATION: EMAIL_oncall_mobile;somehost;Service;OK;notify;exit code is 0

[1421161288] INFO: [Shinken] Stalking Service: exit code is 2
[1421161288] SERVICE ALERT: somehost;Service;CRITICAL;SOFT;1;exit code is 2
[1421161292] SERVICE ALERT: somehost;Service;CRITICAL;SOFT;2;exit code is 2
[1421161297] SERVICE ALERT: somehost;Service;CRITICAL;HARD;3;exit code is 2
[1421161297] SERVICE NOTIFICATION: EMAIL_oncall_mobile;somehost;Service;CRITICAL;notify;exit code is 2
[1421161297] SERVICE NOTIFICATION: EMAIL_group;somehost;Service;CRITICAL;notify;exit code is 2

I do not recall these issues happening on 1.4. I've tested same config against latest 2.2 from git and confirm it still happens there.

naparuba commented 9 years ago

I'm giving a look at this.

naparuba commented 9 years ago

I think this is the expected behavior with the parameter: notification_interval 0

It can be dangerous to allow restart notification logic with minor state change (you where in problem and you are still in problem, so why notify?).

Allowing this can add change the expected behavior from before I thik and will increase notification for some users.

I do nto close this ticket as it can be an enhancement if I'm not wrong in my analysis, but we dhould talk about it before modifying this ^^

openglx commented 9 years ago

I disagree with your comment of 'minor state change'. I don't think that the notification logic should be only between OK/non-OK state but should consider all possible states and their transitions.

In the case above there is legitimate reason to the on-call group only receive transitions towards CRITICAL ('c' flag) or RECOVERY from it ('r' flag allied with 'c'). The other group would receive any message and any change between states.

Again, this was something that previously worked in 1.4 with similar configuration (not using notificationways).

gst commented 9 years ago

I've read the ticket. and I agree that this is either a regression from 1.4 or otherwise at least a documentation issue or miss.

The use-case is quite clear and desirable imo : have some "oncall" contact(s) get only critical (& recoveries from it) notifications.

Seb-Solon commented 9 years ago

As #1329

IMO this is a big fix to do. Even if I commit that I won't merge it for 2.4 as it's RC. Postponing.

efficks commented 6 years ago

Any news on this bug?