Naemon 1.0.3 ignores retry_interval?

mfrost8 commented 9 years ago

I'm using Naemon 1.0.3. My standard service configuration in my test enviroment is

check_interval 5 retry_interval 1 max_attempts 5

In the process of checking when an event handler triggers, I'm discovering that the service I'm checking is "retrying" every 5 minutes, not every 1 minute. I queried the running instance with naemon-unixcat to confirm that the running instance was really using these values and it was. Also watching the "next scheduled check" for this service in Thruk always showed the next check after the initial failure occurring 5 minutes into the future.

I'm using a configuration that is a slightly pared down copy of a config tree running properly under Nagios 3-something. This service does go through a number of templates that inherit from one another all the way down the base "Default Service" definition which is where these values are defined. I want to say template inheritance this is 4 or 5 levels deep.

I'm doing this on SLES 11.3 x86-64.

DanielGT1 commented 9 years ago

I can confirm this. It seems to always use the check_interval value, even in non-OK soft- and hardstates.

I did test with a simple host definition using 5 minutes check and 10 minutes retry interval. Same effect on the service level. OS ist RHEL 6.

Here the settings from objects.cache:

define host { host_name interval_test alias interval_test address 127.0.0.1 check_period 24x7 check_command check_dummy!2!test_down notification_period 24x7 initial_state o hourly_value 1 check_interval 5.000000 retry_interval 10.000000 max_check_attempts 4 active_checks_enabled 1 passive_checks_enabled 1 obsess 0 event_handler_enabled 1 low_flap_threshold 0.000000 high_flap_threshold 0.000000 flap_detection_enabled 0 flap_detection_options a freshness_threshold 0 check_freshness 0 notification_options r,d,u,f notifications_enabled 1 notification_interval 120.000000 first_notification_delay 0.000000 stalking_options n process_perf_data 0 retain_status_information 1 retain_nonstatus_information 1 }

bjornfro commented 8 years ago

I see this issue as well.

nook24 commented 8 years ago

Unfortunately the same issue here on Naemon 1.0.3-source Ubuntu 14.04 Syntax I have tested:

normal_check_interval 5
retry_check_interval     1
max_check_attempts   3

check_interval  5
retry_interval  1
max_attempts  3

catharsis commented 8 years ago

Phew.

I finally got around to fixing this. Reading through the comments here, the experienced behaviour differs somewhat. I've been able to find at least two bugs pertaining to the treatment of retry_interval: The first being that we would not respect the retry_interval for a soft non-OK/UP service/host when the retry_interval is greater than the check_interval for the object. This seems to be the issue experienced by @DanielGT1 above.

The second issue is the one that the patch by @nook24 fixes, which concerns how checks are scheduled as a result of certain external commands. It is possible that this patch might have affected other scheduling bugs when it was written, but usage of the check_window macro is now limited to the scheduling of external commands.

I'm not sure if there are other occurrences of similar bugs (for what it's worth, I wasn't able to find any), so if anyone experiences such issues, please open a new report. I'm closing this one.

Specifically, commits cff979b and 787e901 fixes these bugs.

naemon / naemon-core

Naemon 1.0.3 ignores retry_interval? #117