[Feat]: Add support for collecting metrics from debugfs

ktsaou commented 1 year ago

Problem

There are many useful metrics that are exposed in debugfs.

For example, /sys/kernel/debug/extfrag/extfrag_index provides information about memory fragmentation:

Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000

or ls -l /sys/kernel/debug/zswap/ provides statistics about zswap:

total 0
-r--r--r-- 1 root root 0 Apr 20 20:40 duplicate_entry
-r--r--r-- 1 root root 0 Apr 20 20:40 pool_limit_hit
-r--r--r-- 1 root root 0 Apr 20 20:40 pool_total_size
-r--r--r-- 1 root root 0 Apr 20 20:40 reject_alloc_fail
-r--r--r-- 1 root root 0 Apr 20 20:40 reject_compress_poor
-r--r--r-- 1 root root 0 Apr 20 20:40 reject_kmemcache_fail
-r--r--r-- 1 root root 0 Apr 20 20:40 reject_reclaim_fail
-r--r--r-- 1 root root 0 Apr 20 20:40 same_filled_pages
-r--r--r-- 1 root root 0 Apr 20 20:40 stored_pages
-r--r--r-- 1 root root 0 Apr 20 20:40 written_back_pages

The problem is that it can only be accessed by root:

# ls -ld /sys/kernel/debug
drwx------ 60 root root 0 Apr 20 20:40 /sys/kernel/debug

So, we need an external plugin, with enough capabilities / permissions to collect information from it.

Description

As above.

Importance

nice to have

Value proposition

There is useful information in debugfs we could use to expose it to users.

Proposed implementation

External C plugin, with capabilities and permissions similar to apps.plugin

ilyam8 commented 1 year ago

Ferroin commented 1 year ago

As I mentioned in the related feature request, we probably just need CAP_DAC_READ_SEARCH on the external plugin to achieve this.

thiagoftsm commented 1 year ago

Hey @shyamvalsan and @sashwathn ,

The PR bringing what was required in this issue is already ready for review, but I consider this only the first step. Please, take a look in possible metrics that we can add from /sys/kernel/debug, because the basis are already ready.

Best regards!

ilyam8 commented 1 year ago

@ktsaou @Ferroin a question about extfrag. We have a fragmentation index (the value) per Node and Zone:

# cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 1, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00

The question is about appropriate aggregation - is it possible to aggregate these counters within different zones? If so we should:

name metric (context) mem.fragmentation_index, labels node and zone.
if not, we need to include Zone to metric name: mem.fragmentation_index_zone_{zoneName}, labels node.

And my understanding is that in any case, only min and max aggregation methods provide more or less meaningful results.

ilyam8 commented 1 year ago

From Linux docs

The kernel will not compact memory in a zone if the fragmentation index is <= extfrag_threshold.

Should we collect extfrag_threshold (/proc/sys/vm/extfrag_threshold) too?

Ferroin commented 1 year ago

Aggregation by zone independent of NUMA node makes some sense. Aggregation across zones does not make much sense though because the reasons (and solutions) for fragmentation in a given zone type are highly dependent on the zone type. Averages (as opposed to min or max) across NUMA nodes on the same system may make sense here depending on the exact system setup (for a NUMA system that is set up to auto-balance across NUMA nodes, average fragmentation is actually kind of useful, but for one where an entire node is isolated it doesn’t make much sense).

This is also very much a local metric. Aggregation across Netdata nodes makes little to no sense for it in most cases.

Irrespective of aggregation, extrfag_threshold should probably be collected as a chart variable.

netdata / netdata