naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
151 stars 63 forks source link

Question: Is there a recommendation for the smallest interval to submit the external command SAVE_STATE_INFORMATION? #435

Open ccztux opened 1 year ago

ccztux commented 1 year ago

We have a monitoring cluster and the important files like the retention.dat were synced. Actually the external command SAVE_STATE_INFORMATION will be executed every 5 minutes by the cluster software on the active cluster node.

The retention_update_interval in naemon.cfg is a value in minutes:

# RETENTION DATA UPDATE INTERVAL
# This setting determines how often (in minutes) that Naemon
# will automatically save retention data during normal operation.
# If you set this value to 0, Naemon will not save retention
# data at regular interval, but it will still save retention
# data before shutting down or restarting.  If you have disabled
# state retention, this option has no effect.

retention_update_interval=60

Now my question is if there is a recommendation for the smallest interval the external command SAVE_STATE_INFORMATION should be executed. We would like to decrease the actual value of 5 minutes to something between 10 and 30 seconds.

sni commented 1 year ago

The file will be stored on shutdown as well, so the interval is only relevant for cases where the cluster is suddenly divided or the active core crashes. As far as i know, without digging into the code, saving the retention information blocks the core and prevents scheduling new checks. And depending on the size of this installation (and the disk performance), it might take several seconds to complete the action. I usually would not set this value to less than 5minutes. But one minute should be ok as well if the file is written in less then 5 seconds.

nook24 commented 1 year ago

As far as i know, without digging into the code, saving the retention information blocks the core and prevents scheduling new checks. And depending on the size of this installation (and the disk performance), it might take several seconds to complete the action.

That's also my knowledge of the retention update process. It gets scheduled and executed from the main pid and should therefore also block the main loop of the core.

We had measured an execution time of 30 to 40 seconds, so we decided to set the default update interval to 60 minutes on all of our systems. It's worth mentioning that these measurements were done years ago, when most systems used HDDs.