proposal: runtime/metrics: define a recommended set of metrics

mknyszek commented 2 weeks ago

Introduction

With each Go release the set of metrics exported by the runtime/metrics grows in size. Not all metrics are applicable to all cases, and it can become difficult to identify which metrics are actually useful. This is especially true for projects like OpenTelemetry and Prometheus which want to export some broadly-applicable Go runtime metrics by default, but the full set is overwhelming and not particularly user-friendly.

Another problem with collecting all metrics is cost. The cardinality of the default metric set is closely watched by projects like Prometheus, because downstream users are often paying for the storage costs of these metrics when making use of hosted solutions.

This issue proposes defining a conservative subset of runtime metrics that are broadly applicable, and a simple mechanism for discovering them programmatically.

Proposal

There are two parts to this proposal. The categorization of some metrics as "recommended" by the Go toolchain, and the actual mechanism for that categorization.

To start with, I would like to propose documenting such a set of metrics as "recommended" at the top of the runtime/metrics documentation. Each metric is required to have a full rationale explaining its utility and use-cases. The "recommended" set is intended to hold a lot of weight. We need to make sure the reason why we promote a particular metric is well-documented. The "recommended" set of metrics generally follows the compatibility guarantees of the runtime/metrics package. That being said, a metric is unlikely to be promoted to "recommended" if it's not likely to just exist indefinitely. Still, we reserve the right to remove them.

Next, we'll add a Tags []string field to metric.Description so that these metrics can be found programmatically. We could get by with a simple boolean field, but that's inflexible. In particular, what I'd like to avoid is having dedicated fields for future categorizations such that they end up non-orthogonal and confusing.

The tag indicating the default set will be the string "recommended".

Proposed initial metrics

Below is an initial proposed set of metrics. This list is intended to be a conservative and uncontroversial set of metrics that have clear real-world use-cases.

/gc/gogc:percent - GOGC.
- Rationale: This metric describes the GOGC parameter to the runtime, which sets the CPU/memory trade-off of the GC.
/gc/gomemlimit:bytes - GOMEMLIMIT.
- Rationale: This metric descibes the GOMEMLIMIT parameter to the runtime, which sets a soft memory limit for the runtime.
/gc/heap/allocs:bytes - Total bytes allocated.
- Rationale: This metric may be used to derive an allocation rate in bytes/second, which is useful in understanding GC resource cost impact. In particular, it's useful for diagnosing regressions in production.
/gc/heap/allocs:objects - Total individual allocations made.
- Rationale: This metric may be used to derive an allocation rate in objects/second, which is useful in understanding memory allocation resource cost impact. In particular, it's useful for diagnosing regressions in production.
/gc/heap/goal:bytes - GC heap goal.
- Rationale: This metric is useful for understanding GC behavior, especially when tuning GOGC and GOMEMLIMIT, and a close approximation for heap memory footprint.
/memory/classes/heap/released:bytes - Current count of heap bytes that are released back to the OS but which remain mapped.
- Rationale: This metric is necessary for tuning GOMEMLIMIT. It is also necessary to understand what the runtime believes its own physical memory footprint is, as a subtraction from the total.
/memory/classes/heap/stacks:bytes - Current count of bytes allocated to goroutine stacks.
- Rationale: This metric is necessary for understanding application memory footprints, specifically those that users have some control over and may seek to optimize.
/memory/classes/total:bytes - Total Go runtime memory footprint.
- Rationale: This metric is necessary for tuning GOMEMLIMIT. It's also useful for identifying "other" memory, and together with /memory/classes/heap/released:bytes, what the runtime believes the physical memory footprint of the application is.
/sched/gomaxprocs:threads - GOMAXPROCS.
- Rationale: This metric is a core runtime parameter representing the available parallelism to the application.
/sched/goroutines:goroutines - Current count of live goroutines (blocked, running, etc.).
- Rationale: This metric is useful as a proxy for active work units in many circumstances, though it also includes leaks. Supplement with a goroutine profile for more detail, or app-specific concurrency counters (for example, to track the number of active http.Handlers).
/sched/latencies:seconds - Distribution of time goroutines spend runnable (that is, not blocked), but not running.
- Rationale: This metric is a measure of scheduling latency that is useful as a fine-grained proxy for overall system load. For example, diffing the distribution over short time windows can provide visibility into the latency impact of uneven load. This metric is battle-tested and has been found to be useful in a variety of scenarios.

This results in 10 uint64 metrics and 1 Float64Histogram metric in the default set, a significant reduction from the 81 metrics currently exported by the package.

Here are a few other metrics that were not included.

/memory/classes/heap/objects:bytes - Current count of bytes allocated.
- Rationale: It is already possible to derive this from total allocations and frees.
/memory/classes/metadata/other:bytes - Runtime metadata, mostly GC metadata.
- Rationale: We expect to break this category out as specific things that are useful to measure come up. This does not indicate good longevity.
/gc/heap/frees:bytes - Total bytes freed.
- Rationale: This metric may be used to compute the total amount of live+unswept heap memory, with /gc/heap/allocs:bytes. Not that useful on its own, and live+unswept heap memory isn't a terribly useful metric since it tends to be noisy and misleading, subject to sweep scheduling nuances. The heap goal is a much more reliable measure of total heap footprint.
/gc/heap/frees:objects - Total individual allocations freed.
- Rationale: This metric may be used to compute the total number of live objects, with /gc/heap/allocs:objects. Not that useful on its own, and the number of live objects on its own also isn't that useful. Together with /gc/heap/frees:objects, /gc/heap/allocs:bytes, and /gc/heap/frees:bytes it can be used to calculate average object size, but that's also not very useful on its own. The distribution of object sizes is more useful, but the metric is currently incomplete, as it currently buckets all objects >32 KiB in size together.
/godebug/non-default/* - Count of instances of a behavior change due to a GODEBUG setting.
- Rationale: While counting instances of non-default behavior is important, the usage of these particular metrics is intended to be used more on a case-by-case basis. Consider a team upgrading their go.mod version. The default behavior of their programs may change due to the upgrade, but because they're using the new defaults, these metrics won't actually be updated. If something goes wrong due to the new defaults, these metrics aren't that helpful in identifying that it's due to the new default behavior. Instead, these metrics are helpful for eliminating remaining sources of non-default behavior once opted-in.

Alternatives

Only documenting the recommended set

One alternative is to only document the set of recommended metrics. This is fine, but it also runs counter to runtime/metrics' original goal of being able to discover metrics programmatically. Some mechanism here seems necessary to keep the package useful to both humans and computers.

A toolchain-versioned default metrics set

Originally, we had considered an API (for example, metrics.Recommended(...)) that accepted a Go toolchain version and would return the set of default metrics (specifically, a []metrics.Description) for that version. All the metrics within would always be valid to pass to metrics.Read.

You could also imagine this set being controlled via the language version set in the go.mod indirectly via GODEBUG flags. (That is, every time we would change this set, we'd add a valid value to GODEBUG. Specifically something like GODEBUG=runtimemetricsgo121=1.)

Unfortunately, there are already a lot of questions here about stability and versioning. Least of which is the fact that toolchain versions, at least those reported by the runtime/debug package, aren't very structured.

Furthermore, this is a type of categorization that doesn't really compose well. If we ever wanted new categories, we'd need to define a new API, or possibly dummy toolchain strings. It's also a much more complicated change.

mknyszek commented 2 weeks ago

@dashpole Out of curiosity, would OpenTelemetry use the Tags []string field at all, or is there a desire to tightly control the metrics exported by default? Same question for @bwplotka for Prometheus. The purpose of this field is to stay in line with the spirit of this package, which is to make as much information programmatically available as possible.

mknyszek commented 2 weeks ago

CC @rhysh @bboreham @felixge @prattmic

dashpole commented 2 weeks ago

Out of curiosity, would OpenTelemetry use the Tags []string field at all, or is there a desire to tightly control the metrics exported by default?

We would likely use it as part of the tests for the package as a way to verify that we are exposing all of the recommended metrics to users. We would probably not use it for programmatic generation of new metrics.

MikeMitchellWebDev commented 2 weeks ago

As @rsc said in 2021 (https://github.com/golang/go/issues/43555), Expvar is a bit left behind at this point. JSON is a very popular format for developers. Can any decision about runtime.Metrics take Expvar and runtime.Memstats into consideration?

mknyszek commented 2 weeks ago

@MikeMitchellWebDev It's true that expvar is missing runtime/metrics data, but it's unclear which metrics should be added and how. Please reply on https://github.com/golang/go/issues/43555; this proposal is not the right place to discuss changing expvar. See also #61638 which isn't directly related, but maybe should be considered as well in any rethink of expvar.

Lastly, note that runtime.MemStats is generally a subset of runtime/metrics, and the latter should be preferred in general (for a number of reasons, including additional metrics as well as better performance). The only reason it is not officially deprecated is because it provides stronger guarantees than runtime/metrics. I admit I haven't been very good about updating the MemStats documentation to make this clear. I'll try to find some time this week to fix that.

EDIT: To be clear, I don't mean deprecating runtime.MemStats. That requires a proper proposal. Just documentation comparing and contrasting runtime.MemStats with the runtime/metrics package.

ArthurSens commented 2 weeks ago

Hello hello, I'm Arthur from prometheus/client_golang team 👋

We have a different set of default metrics and I believe we can't just change the default exposed metrics without a major version bump. One approach we could use is to have a configuration option in client_golang "ExposeRecommendedMetrics", but I predict we'll have questions like "why the recommended metrics aren't the default?".

With that said, I like the idea of the Go team providing instructions about what metrics are worth paying the price to collect and store. I also like the idea of those instructions being programmatically available, we just need to evaluate if there's a need for a major version bump in client_golang and if it's worth the effort

bboreham commented 2 weeks ago

@ArthurSens for some historical perspective see https://github.com/prometheus/client_golang/pull/955 and https://github.com/prometheus/client_golang/pull/1033

My suggestion was that client_golang would offer three main options for what is exported: "exactly what it did in the previous version", "the Go recommended metrics" and "everything that Go exposes". I imagine we can cope with the questions.

ArthurSens commented 2 weeks ago

@ArthurSens for some historical perspective see prometheus/client_golang#955 and prometheus/client_golang#1033

My suggestion was that client_golang would offer three main options for what is exported: "exactly what it did in the previous version", "the Go recommended metrics" and "everything that Go exposes". I imagine we can cope with the questions.

Thanks for the extra context! Yeah, I agree we can offer the recommended metrics in some way :)

dashpole commented 1 week ago

@mknyszek for "Recommended" histogram metrics (currently just /sched/latencies:seconds), are the bucket boundaries on histograms guaranteed to remain stable (i.e. no buckets removed)?

arl commented 1 week ago

@mknyszek for "Recommended" histogram metrics (currently just /sched/latencies:seconds), are the bucket boundaries on histograms guaranteed to remain stable (i.e. no buckets removed)?

The proposal states that the 'Recommended' set follows the guarantees of the "runtime/metrics" package:

    // For a given metric name, the value of Buckets is guaranteed not to change
    // between calls until program exit.

from: https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/runtime/metrics/histogram.go;l=26-27

dashpole commented 1 week ago

I'm asking about general stability (e.g. across go versions, or across multiple instances of an applications).

mknyszek commented 1 week ago

@dashpole No, they're not guaranteed to remain stable and have changed across Go versions. We've definitely removed buckets before.

bwplotka commented 1 week ago

Thanks for this, great work!

What's the end goal?

This is especially true for projects like https://github.com/open-telemetry/semantic-conventions/issues/535 and Prometheus which want to export some broadly-applicable Go runtime metrics by default, but the full set is overwhelming and not particularly user-friendly.

I wonder what is the exact intention and the end-goal behind this proposal. Is it to:

A. Convince the common instrumentation SDKs to give the Go team control over the default published metrics for the Go runtime? So the largest amount of Go applications possible have those common metrics OOTB, and adopt potential metrics changes as soon as they are rebuilt with a new Go version?

or...

B. To support a certain amount of users who wants to stay with the Go runtime "default" metrics that might change on Go version to version basis and there are fine with that. C. Suggest what SDKs should add manually to the default set of metrics.

Picking a healthy, limited "recommended/default" set from the Go team is definitely helping for all of those. I love the recommendation mechanism too, easy to use to me. As co-maintainer of the Prometheus client_golang I fully support @ArthurSens words around adding a programmatic option e.g. WithGoCollectorRecommendedMetrics() that uses this. However, that will get you to the B only.

I wonder if A is realistic. Then if A is not possible at the moment, because e.g. OpenTelemetry and/or Prometheus client_golang (potentially popular metric SDKs) want to keep the influence on what's default (the current status quo), than is this proposal still viable?

I think to motivate SDKs to pursue A with Go team, we need to learn more about pros & cons here. What user will get out of it vs SDK adding manually some Go runtime metrics to default based on user feedback and the recent changes to recommended set? Some cons would be potentially different stability guarantees across Go team vs Otel vs Prometheus.

To sum up, is it A? Can we unpack pros & cons here for SDKs to assess those?

Recommended Metrics

TL;DR: Those make sense. /sched/latencies:seconds feels the most controversial for Prometheus, (usefulness vs cardinality) but only until we can put it in the new type (native histogram), then it should be fine.

Just to evaluate your proposed metrics and contribute to pros & cons of using Go recommended metrics as default, I diffed what client_golang has now vs recommended.

NOTE: All _memstats_ metrics come actually from the new Go runtime metrics, we just kept the name for stability (it's hard to rename metric from the user perspective).

Default runtime metrics from client_golang	Recommended Go runtime
go_gc_duration_seconds
go_goroutines	/sched/goroutines:goroutines
go_info
go_memstats_last_gc_time_seconds
go_threads
go_memstats_alloc_bytes
go_memstats_alloc_bytes_total	/gc/heap/allocs:bytes
go_memstats_sys_bytes	/memory/classes/total:bytes
go_memstats_lookups_total
go_memstats_mallocs_total	kind of /gc/heap/allocs:objects but with tiny allocs
go_memstats_frees_total
go_memstats_heap_alloc_bytes
go_memstats_heap_sys_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes	/memory/classes/heap/released:bytes
go_memstats_heap_objects
go_memstats_stack_inuse_bytes	/memory/classes/heap/stacks:bytes
go_memstats_stack_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_other_sys_bytes
go_memstats_next_gc_bytes	/gc/heap/goal:bytes
	/gc/gogc:percent
	/gc/gomemlimit:bytes
	/sched/gomaxprocs:threads
	/sched/latencies:seconds

To sum up, I think I Prometheus is really close to recommended ones, plus I would propose adding /gc/gogc:percent, /gc/gomemlimit:bytes and /sched/gomaxprocs:threads to Prometheus go collector runtime default as those are important runtime variables to consider.

With that.. it's only /sched/latencies:seconds left, so having a current default PLUS (deduplicated) recommended set as our default might be a potential option to consider depending on:

the pros & cons discussion I proposed above
stability guarantees
other team members sentiments

mknyszek commented 19 hours ago

I wonder what is the exact intention and the end-goal behind this proposal. Is it to:

A. Convince the common instrumentation SDKs to give the Go team control over the default published metrics for the Go runtime? So the largest amount of Go applications possible have those common metrics OOTB, and adopt potential metrics changes as soon as they are rebuilt with a new Go version?

or...

B. To support a certain amount of users who wants to stay with the Go runtime "default" metrics that might change on Go version to version basis and there are fine with that. C. Suggest what SDKs should add manually to the default set of metrics.

It's really C, in practice. B is nice for those that want it, but I don't think A is practical. Everyone is always going to be free to choose what metrics they collect and/or expose at any layer.

Really I think we're just trying to set a better foundation here than the existing, somewhat haphazard, "collect MemStats and a few other metrics from some of the other runtime and runtime/debug functions" that is fairly widespread at this point. It's a starting point for what I hope will be a slow-but-virtuous cycle of the ecosystem informing the recommended set, and the recommended set informing the ecosystem, so we get a high signal-to-noise ratio for observability.

[...] Those [recommended metrics] make sense. /sched/latencies:seconds feels the most controversial for Prometheus, (usefulness vs cardinality) but only until we can put it in the new type (native histogram), then it should be fine.

FWIW, my thought was that SDKs can just choose to skip inherently high cardinality types programmatically, like Float64Histogram.

To sum up, I think I Prometheus is really close to recommended ones, plus I would propose adding /gc/gogc:percent, /gc/gomemlimit:bytes and /sched/gomaxprocs:threads to Prometheus go collector runtime default as those are important runtime variables to consider.

That's a good sign IMO. I'm supportive of adding those. While they're likely to be exactly the same over time, the fact is that you can mutate automatically at runtime. As above, re: /sched/latencies:seconds, I think it's fine if SDKs want to leave out certain metrics because they pose problems for collection.

golang / go