Open jinnatar opened 2 years ago
Good feature design 👍 I'd be happy to upstream this feature if someone has worked or plans to work on it!
Looking at the code responsible for this, it seems to me the keep-alive threshold should still define the check interval but then the modification would be to:
keep_alive_monitor
keep & increase an event counter when the threshold is met and reset once it's no longer met.notify=False
flag which would cause the event to be logged, but not sent to other notification channels. (Which could be a useful feature for other things as well. Also might be less confusing for someone not aware of the backoff if the logs say "offline for 50000 seconds but not notifying until Y seconds to avoid flooding you.")Doesn't seem too difficult to implement. If the above plan sounds ok, I should be able to eventually knock something out. My immediate issue is gone though via automatic service restarts based on monitoring TCP 8444. :-)
Yes, sounds good, I like the idea to still log out all the events with informations for the reason of missing notification. I think it's OK to handle this entirely in the keep_alive_monitor
component because the notify_manager
doesn't have any other function beside notification at the moment (in regards to passing notify=False
if I understood that correctly).
That is unless we want to have a more general notification throttling that works across all event types. But as I understand it, this is currently the most offensive notification type.
Alright, so I'll try to implement a really targeted fix for the keep-alive but keep in mind that it could be refactored later into a more generic solution.
TL;DR; a prolonged harvester outage of say 8h during sleep will produce a notification every 5 minutes per harvester. This gets annoying to wake up to, and also may consume limited resources / credits of the notification service. Instead an exponential backoff should be used which can easily drop consumption by 90% while still providing much the same level of urgency to an operator.
Steps to produce suboptimal behavior:
Better behavior:
Solution summary:
T() = interval * exponential_rate ^ retryNumber