feat: core queue and thread pool queue capacity (utilization) task monitor

I've broken down this requirement from https://github.com/kytos-ng/kytos/issues/423. It's desirable to have a feature that can monitor if core queues are getting full and also if thread pool executors queues are increasing too much.

The minimum functionality for this version is to provide a configuration that can log warnings over a time if a queue and or a queue of an executor keeps getting over a configurable threshold. In this version, it'll be configurable via kytos.conf (configuration changes will require a restart, which shouldn't be that frequent, and restarts on kytosd under normal circumstances don't cause network service data plane disruptions).

Notes

In the future, we can keep expanding this and even expose more data suitable for time series, but for now to provide just a bare minimum monitor with logging that's what needed. The good thing about the warning logs it's also it provides a way for correlation since we'll have adjacent events, that paired with APM should also provide sufficient visibility to understand which handlers and producing too many events for further analying if the queues size indeed need to be increased or if the NApp is misbehaving and needs to be fixed.

Here's an initial minimum viable proposal for this iteration (2023.2), let me know what you think, I believe this cover the basic use cases we're looking for, that coupled with APM should be a powerful combo to have better visibility:

Requirements

A queue monitor should be configurable via kytos.conf
A queue monitor will sample each associated queue (wether it's a core queue KytosEventBuffer/janus or ThreadPoolExecutor/Queue) every second
In the configuration, users can set the minimum number of hits / seconds and a queue size threshold
If a monitored queue utilization goes over the rate, then it should log at the end of the window. It's expected to use a fixed window to simplify logging, so if the window is sized properly then so will be volume of logs.
When logging the records it should include the min, max and avg queue size in addition to the individual records with timestamps in UTC.
It should also provide a way to potentially only log only up to x records if there are too many records in a given window and the user only wants a subsequence.

Proposal

Here's a config proposal that can be used on kytos.conf. Notice that you can define a list of monitors, the typical use case you'll probably just want one kind of configuration for all thread pools and just another for the event buffers, but as you need different rates you can associate the queues accordingly:

(the min_hits/delta_secs below 5/5 means that if there are at least 5 hits of 80% utilization - min_queue_full_percent within 5 seconds, it'll start logging by the end of the 5 seconds)

# Queue monitors are for detecting and alerting certain queuing thresholds over a delta time.
# Each queue size will be sampled every second. min_hits / delta_secs needs to be <= 1
# hits/seconds is measured as a fixed window if the sampled rate is over min_hits/second, it'll log the records at the end of each window

# The queue size is the internal of the thread pool, it's unbounded, if it's queueing too much you might try to want to increase the number of thread pools workers
thread_pool_queue_monitors =
  [
    {
      "min_hits": 5,
      "delta_secs": 5,
      "min_queue_full_percent": 100,
      "log_at_most_n": 0,
      "queues": ["sb", "app", "db"]
    }
  ]

# The queue size is derived frome each buffer queue
event_buffer_monitors =
  [
    {
      "min_hits": 5,
      "delta_secs": 5,
      "min_queue_full_percent": 100,
      "log_at_most_n": 0,
      "buffers": ["msg_in", "msg_out", "raw", "app"]
    }
  ]

Discarded Ideas (so far)

I also considered to leverage limits, which @Ktmi is introducing on kytos core for rate limiting on PR 412, however, limits is really well designed for what it does which is really rate limiting "inline", although it's also possible to use it not "inline" and just leverage its measured window for stats, we'd also need to keep track of individual records, so we do not gain much, other than having to introduce it in a non standard way. In the future we might replace a fixed window (deque) with another type of window from limits, but until we have a stronger reason and if we ever need more advanced type of measured windows then we can reconsider and potentially introduce in the future.

Future

In the future, we might provide a more dynamic configuration via an API endpoint, but that would also need storage to be beneficial. For now, since this configuration won't tend to change that frequently it's OK to only have it on kytos.conf, and a kytosd restart shouldn't cause any data plane outages.

kytos-ng / kytos