Closed viniarck closed 4 months ago
Here's an initial minimum viable proposal for this iteration (2023.2
), let me know what you think, I believe this cover the basic use cases we're looking for, that coupled with APM should be a powerful combo to have better visibility:
KytosEventBuffer/janus
or ThreadPoolExecutor/Queue
) every secondmin
, max
and avg queue size in addition to the individual records with timestamps in UTC.Here's a config proposal that can be used on kytos.conf
. Notice that you can define a list of monitors, the typical use case you'll probably just want one kind of configuration for all thread pools and just another for the event buffers, but as you need different rates you can associate the queues accordingly:
(the min_hits/delta_secs
below 5/5 means that if there are at least 5 hits of 80% utilization - min_queue_full_percent
within 5 seconds, it'll start logging by the end of the 5 seconds)
# Queue monitors are for detecting and alerting certain queuing thresholds over a delta time.
# Each queue size will be sampled every second. min_hits / delta_secs needs to be <= 1
# hits/seconds is measured as a fixed window if the sampled rate is over min_hits/second, it'll log the records at the end of each window
# The queue size is the internal of the thread pool, it's unbounded, if it's queueing too much you might try to want to increase the number of thread pools workers
thread_pool_queue_monitors =
[
{
"min_hits": 5,
"delta_secs": 5,
"min_queue_full_percent": 100,
"log_at_most_n": 0,
"queues": ["sb", "app", "db"]
}
]
# The queue size is derived frome each buffer queue
event_buffer_monitors =
[
{
"min_hits": 5,
"delta_secs": 5,
"min_queue_full_percent": 100,
"log_at_most_n": 0,
"buffers": ["msg_in", "msg_out", "raw", "app"]
}
]
limits
, which @Ktmi is introducing on kytos core for rate limiting on PR 412, however, limits
is really well designed for what it does which is really rate limiting "inline", although it's also possible to use it not "inline" and just leverage its measured window for stats, we'd also need to keep track of individual records, so we do not gain much, other than having to introduce it in a non standard way. In the future we might replace a fixed window (deque
) with another type of window from limits
, but until we have a stronger reason and if we ever need more advanced type of measured windows then we can reconsider and potentially introduce in the future.kytos.conf
, and a kytosd
restart shouldn't cause any data plane outages.
I've broken down this requirement from https://github.com/kytos-ng/kytos/issues/423. It's desirable to have a feature that can monitor if core queues are getting full and also if thread pool executors queues are increasing too much.
The minimum functionality for this version is to provide a configuration that can log warnings over a time if a queue and or a queue of an executor keeps getting over a configurable threshold. In this version, it'll be configurable via kytos.conf (configuration changes will require a restart, which shouldn't be that frequent, and restarts on
kytosd
under normal circumstances don't cause network service data plane disruptions).Notes
In the future, we can keep expanding this and even expose more data suitable for time series, but for now to provide just a bare minimum monitor with logging that's what needed. The good thing about the warning logs it's also it provides a way for correlation since we'll have adjacent events, that paired with APM should also provide sufficient visibility to understand which handlers and producing too many events for further analying if the queues size indeed need to be increased or if the NApp is misbehaving and needs to be fixed.