Feat: consider adopting procfs lib FS.Meminfo() for memory collector

tjhop commented 6 months ago

As of #2952, the node exporter has been bumped to use procfs lib v0.13.0, which has a fix for safer meminfo parsing from /proc/meminfo. This means it's possible to move away from the custom meminfo parsing the node exporter currently does and use the updated library's parsing instead.

Considerations: The node exporter memory collector's Update() func uses and expects memory info to be returned as a map[string]float64 from the various platform implementations, which means that even if we adopt the library's updated memory info parsing, we would then need to convert the struct into the expected map type. This can be done with a quick json Marshal/Unmarshal dance playground, if we're willing to pull encoding/json in as a dependency. I'd really rather avoid manually/explicitly parsing out the struct fields as it feels fragile and prone to breakage on procfs updates, so ideas welcome.

I'm willing to implement the changes if the concepts here are accepted :+1:

discordianfish commented 6 months ago

I don't think we should marshal them to a map though, we should finally make the meminfo metrics follow more the best pratices. E.g using labels for metrics that can be summed up. For that I'd suggest creating a new meminfo collector and deprecate the old one, then in a next major release enabled the new one by default and disable the deprecated one.

@SuperQ wdyt?

rexagod commented 4 months ago

I can work on that if @SuperQ is +1, and @tjhop has no plans in the near future to take this up.

tjhop commented 4 months ago

Thanks @rexagod! I was mostly waiting on the green light to proceed, I'm still willing to take this on. However, I would be very happy/grateful if you would be willing to help review the PR once it's pushed and/or PR against my branch if you want to collaborate more.

Initial thoughts/questions for feedback:

do we want a feature flag to toggle the new collector? I would think so
prometheus/procfs is clearly pretty *nix oriented, do we also convert to the proposed new metrics format for darwin/netbsd/openbsd? I would think so, for at least consistency reasons
similar question above for the meminfo numa collector -- should it also get normalized to the new format?
I'd suggest creating a new meminfo collector and deprecate the old one -- should the metrics stay in the memory_ subsystem namespace still?
E.g using labels for metrics that can be summed up -- this is a great idea. Metric naming docs provide the following guidance:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not.

With this in mind, how many metrics/labels do we want to have? Some metrics in the darwin/netbsd/openbsd meminfo collectors are counters, should they remain counters (and thus a separate metric)?
There's lots of downstream repos that will likely need to be updated to account for these changes (monitoring mixin rules, etc), and likely not all of them under the purview of the prometheus project itself. How to best communicate intended changes?

(sorry for the stream of consciousness, like I said, initial thoughts :upside_down_face: )

discordianfish commented 4 months ago

do we want a feature flag to toggle the new collector? I would think so

If it's a new collector, it can be disabled/enabled - so no 'feature flag' specifically

prometheus/procfs is clearly pretty *nix oriented, do we also convert to the proposed new metrics format for darwin/netbsd/openbsd? I would think so, for at least consistency reasons

If we can support the other OSes with the new collector, cool - if not, we can add support for that later.

similar question above for the meminfo numa collector -- should it also get normalized to the new format? If it fits the scope of the new collector, why not.

I'd suggest creating a new meminfo collector and deprecate the old one -- should the metrics stay in the memory_ subsystem namespace still?

Yes, I'd say we make the collectors mutually exclusive so you can use the same metric names where it makes sense

With this in mind, how many metrics/labels do we want to have? Some metrics in the darwin/netbsd/openbsd meminfo collectors are counters, should they remain counters (and thus a separate metric)?

The general best practices apply, so yeah we shouldn't mix counters and gauges. Only things where sum() makes sense should be labels in the same metric.

There's lots of downstream repos that will likely need to be updated to account for these changes (monitoring mixin rules, etc), and likely not all of them under the purview of the prometheus project itself. How to best communicate intended changes?

Thats why I suggest a new collector (and mark the old one deprecated eventually), downstream projects can still use the old one but get warnings that it is deprecated

rexagod commented 3 months ago

I'd be happy to review your PR, @tjhop! Feel free to tag me there once its up! Godspeed! 👋🏼

SuperQ commented 3 months ago

I'd really rather avoid manually/explicitly parsing out the struct fields as it feels fragile and prone to breakage on procfs updates, so ideas welcome.

This is actually quite intentional, and the recommended way to do things in Go. Struct breakage is explicit at compile time, so it's quite stable.

Dynamic mapping, while common and convenient for the developer, is fragile. I much prefer explicit struct-to-metric mapping like is done in other collectors. For example, take a look at the xfrm collector. It appears verbose, but it's explicit and compile-time safe.

I don't see a major need to create a new collector. Just convert the existing dynamic mapping to an explicit mapping.

discordianfish commented 3 months ago

I don't see a major need to create a new collector. Just convert the existing dynamic mapping to an explicit mapping.

Depends on whether we want to fix/change the metric names

tjhop commented 3 months ago

@SuperQ I've grown to agree with you since my last comment, re: explicit struct mapping and have taken that approach in the PR.

I'm happy to re-scope #3043 to just refactoring the existing meminfo collector while we further discuss whether or not to refactor the memory metrics and how to label them :+1:

SuperQ commented 3 months ago

Yea, let's just do the minimal migration and do any metric renaming as a separate task. Thanks!

tjhop commented 2 months ago

Circling back to this -- the node exporter has been updated to use procfs lib for the meminfo collector, so I believe the core of this issue is complete.

Are we ok with opening a new issue if/when it's time to discuss renaming the metrics?

prometheus / node_exporter

Feat: consider adopting procfs lib FS.Meminfo() for memory collector #2957