kytos-ng / kytos

Kytos SDN Platform. Kytos is designed to be easy to install, use, develop and share Network Apps (NApps).
https://kytos-ng.github.io/
MIT License
2 stars 7 forks source link

feat: core queue and thread pool queue capacity (utilization) task monitor #439

Closed viniarck closed 4 months ago

viniarck commented 5 months ago

I've broken down this requirement from https://github.com/kytos-ng/kytos/issues/423. It's desirable to have a feature that can monitor if core queues are getting full and also if thread pool executors queues are increasing too much.

The minimum functionality for this version is to provide a configuration that can log warnings over a time if a queue and or a queue of an executor keeps getting over a configurable threshold. In this version, it'll be configurable via kytos.conf (configuration changes will require a restart, which shouldn't be that frequent, and restarts on kytosd under normal circumstances don't cause network service data plane disruptions).

Notes

In the future, we can keep expanding this and even expose more data suitable for time series, but for now to provide just a bare minimum monitor with logging that's what needed. The good thing about the warning logs it's also it provides a way for correlation since we'll have adjacent events, that paired with APM should also provide sufficient visibility to understand which handlers and producing too many events for further analying if the queues size indeed need to be increased or if the NApp is misbehaving and needs to be fixed.

viniarck commented 5 months ago

Here's an initial minimum viable proposal for this iteration (2023.2), let me know what you think, I believe this cover the basic use cases we're looking for, that coupled with APM should be a powerful combo to have better visibility:

Requirements

Proposal

Here's a config proposal that can be used on kytos.conf. Notice that you can define a list of monitors, the typical use case you'll probably just want one kind of configuration for all thread pools and just another for the event buffers, but as you need different rates you can associate the queues accordingly:

(the min_hits/delta_secs below 5/5 means that if there are at least 5 hits of 80% utilization - min_queue_full_percent within 5 seconds, it'll start logging by the end of the 5 seconds)

# Queue monitors are for detecting and alerting certain queuing thresholds over a delta time.
# Each queue size will be sampled every second. min_hits / delta_secs needs to be <= 1
# hits/seconds is measured as a fixed window if the sampled rate is over min_hits/second, it'll log the records at the end of each window

# The queue size is the internal of the thread pool, it's unbounded, if it's queueing too much you might try to want to increase the number of thread pools workers
thread_pool_queue_monitors =
  [
    {
      "min_hits": 5,
      "delta_secs": 5,
      "min_queue_full_percent": 100,
      "log_at_most_n": 0,
      "queues": ["sb", "app", "db"]
    }
  ]

# The queue size is derived frome each buffer queue
event_buffer_monitors =
  [
    {
      "min_hits": 5,
      "delta_secs": 5,
      "min_queue_full_percent": 100,
      "log_at_most_n": 0,
      "buffers": ["msg_in", "msg_out", "raw", "app"]
    }
  ]

Discarded Ideas (so far)

Future