stackrox / collector

Runtime data collection for the StackRox Kubernetes Security Platform using eBPF
Apache License 2.0
52 stars 23 forks source link

Metrics thread table size #1456

Closed erthalion closed 4 months ago

erthalion commented 7 months ago

While troubleshooting vanilla Falco and previous memory related issues it proved to be useful to have an understanding how the thread cache is growing. To make it more visible, expose the current thread table size as a new prometheus metric, e.g. rox_collector_events{type="threadCacheSize"}.

The numbers we're interested in could be obtained via libsinsp inspector function get_thread_count(). Since this metric is not directly dependent on event stream, we need to decide when exactly to take the counters current value -- to not do unnecessary work if it's changing slowly, but to be fine-grained enough to notice relevant spikes.

When troubleshooting vanilla Falco, the hacky solution I used was to log thread count with throttled logging, and it was providing enough information. Thus, the proposal is to update thread counter metric based on the number of processes received, but with some throttling, e.g. when we receive every n'th process we capture current thread table size.

Part of #1320

ovalenti commented 7 months ago

The CollectorStatsExporter runs a loop dedicated to publish the current counters/timers every 5s. Maybe this is acceptable as a time basis ?

erthalion commented 7 months ago

The CollectorStatsExporter runs a loop dedicated to publish the current counters/timers every 5s. Maybe this is acceptable as a time basis ?

Yeah, sounds reasonable.