prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
10.64k stars 2.3k forks source link

feat!: new meminfo_procfs collector, deprecate meminfo collector #3043

Closed tjhop closed 3 weeks ago

tjhop commented 3 weeks ago

Part of #2957

Adds a new collector named meminfo_procfs that exposes memory metrics in a format that attemps to be more inline with upstream conventions -- memory metrics are now exposed under a single metric named node_memory_bytes that has a single label called field, corresponding to the name of the field in /proc/meminfo that value represents. Label values for the field label are named according to the struct field in the procfs.Meminfo struct, and the values always use the byte-normalized counterpart fields in the procfs.Meminfo struct, resulting in a transition such as the following: node_memory_Active_anon_bytes -> node_memory_bytes{field="ActiveAnon"}

Notes: currently linux only, as that is the focus of the procfs lib. once consensus has been reached here on new metric name/labels/format, I can expand coverage for darwin/openbsd/netbsd and forward-port the existing meminfo collector for those platforms to use the updated format.

tjhop commented 3 weeks ago

@discordianfish @SuperQ @rexagod would you mind checking this out when time permits? Feedback welcome :+1:

tjhop commented 3 weeks ago

Screenshot from 2024-06-08 15-26-10

Running the new build on 9101 and latest node_exporter on 9100 for some side-by-side, and thus far in my testing things seem to line up

tjhop commented 3 weeks ago

No worries, I also wasn't a huge fan of this approach but it felt like the most "direct" conversion to at least start the conversation based on some of the feedback in #2957.

Having to use on or ignoring for basic operations is a worse user experience.

I'll agree with that :+1:

Where labels are meant to be used is if operations like sum() are appropriate. It doesn't make sense to sum(node_memory_bytes{field=~"MemTotal|MemAvailable"}), so this is a clear indicator that these should be separate metric names.

Perhaps I was a bit too loose in my interpretation of "meaningful" when looking at the docs? :wink:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful).

Don't worry -- I'm aware the next line of the quote uses an example of a queue with capacity/number of elements which is a decent corollary to our memory free/total example here, but to your point, it does raise questions about how to appropriately divide/label the metrics that haven't been fully fleshed out yet. For instance, should we have a separate metric for memory vs swap memory? And the current memory metrics are unlabeled, what other labels might be useful here?

tjhop commented 3 weeks ago

Closing in favor of #3049