Open mknyszek opened 2 weeks ago
@dashpole Out of curiosity, would OpenTelemetry use the Tags []string
field at all, or is there a desire to tightly control the metrics exported by default? Same question for @bwplotka for Prometheus. The purpose of this field is to stay in line with the spirit of this package, which is to make as much information programmatically available as possible.
CC @rhysh @bboreham @felixge @prattmic
Out of curiosity, would OpenTelemetry use the Tags []string field at all, or is there a desire to tightly control the metrics exported by default?
We would likely use it as part of the tests for the package as a way to verify that we are exposing all of the recommended metrics to users. We would probably not use it for programmatic generation of new metrics.
As @rsc said in 2021 (https://github.com/golang/go/issues/43555), Expvar is a bit left behind at this point. JSON is a very popular format for developers. Can any decision about runtime.Metrics take Expvar and runtime.Memstats into consideration?
@MikeMitchellWebDev It's true that expvar
is missing runtime/metrics
data, but it's unclear which metrics should be added and how. Please reply on https://github.com/golang/go/issues/43555; this proposal is not the right place to discuss changing expvar
. See also #61638 which isn't directly related, but maybe should be considered as well in any rethink of expvar
.
Lastly, note that runtime.MemStats
is generally a subset of runtime/metrics
, and the latter should be preferred in general (for a number of reasons, including additional metrics as well as better performance). The only reason it is not officially deprecated is because it provides stronger guarantees than runtime/metrics
. I admit I haven't been very good about updating the MemStats
documentation to make this clear. I'll try to find some time this week to fix that.
EDIT: To be clear, I don't mean deprecating runtime.MemStats
. That requires a proper proposal. Just documentation comparing and contrasting runtime.MemStats
with the runtime/metrics
package.
Hello hello, I'm Arthur from prometheus/client_golang team 👋
We have a different set of default metrics and I believe we can't just change the default exposed metrics without a major version bump. One approach we could use is to have a configuration option in client_golang "ExposeRecommendedMetrics", but I predict we'll have questions like "why the recommended metrics aren't the default?".
With that said, I like the idea of the Go team providing instructions about what metrics are worth paying the price to collect and store. I also like the idea of those instructions being programmatically available, we just need to evaluate if there's a need for a major version bump in client_golang and if it's worth the effort
@ArthurSens for some historical perspective see https://github.com/prometheus/client_golang/pull/955 and https://github.com/prometheus/client_golang/pull/1033
My suggestion was that client_golang
would offer three main options for what is exported: "exactly what it did in the previous version", "the Go recommended metrics" and "everything that Go exposes". I imagine we can cope with the questions.
@ArthurSens for some historical perspective see prometheus/client_golang#955 and prometheus/client_golang#1033
My suggestion was that
client_golang
would offer three main options for what is exported: "exactly what it did in the previous version", "the Go recommended metrics" and "everything that Go exposes". I imagine we can cope with the questions.
Thanks for the extra context! Yeah, I agree we can offer the recommended metrics in some way :)
@mknyszek for "Recommended" histogram metrics (currently just /sched/latencies:seconds
), are the bucket boundaries on histograms guaranteed to remain stable (i.e. no buckets removed)?
@mknyszek for "Recommended" histogram metrics (currently just
/sched/latencies:seconds
), are the bucket boundaries on histograms guaranteed to remain stable (i.e. no buckets removed)?
The proposal states that the 'Recommended' set follows the guarantees of the "runtime/metrics" package:
// For a given metric name, the value of Buckets is guaranteed not to change
// between calls until program exit.
from: https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/runtime/metrics/histogram.go;l=26-27
I'm asking about general stability (e.g. across go versions, or across multiple instances of an applications).
@dashpole No, they're not guaranteed to remain stable and have changed across Go versions. We've definitely removed buckets before.
Thanks for this, great work!
This is especially true for projects like https://github.com/open-telemetry/semantic-conventions/issues/535 and Prometheus which want to export some broadly-applicable Go runtime metrics by default, but the full set is overwhelming and not particularly user-friendly.
I wonder what is the exact intention and the end-goal behind this proposal. Is it to:
A. Convince the common instrumentation SDKs to give the Go team control over the default published metrics for the Go runtime? So the largest amount of Go applications possible have those common metrics OOTB, and adopt potential metrics changes as soon as they are rebuilt with a new Go version?
or...
B. To support a certain amount of users who wants to stay with the Go runtime "default" metrics that might change on Go version to version basis and there are fine with that. C. Suggest what SDKs should add manually to the default set of metrics.
Picking a healthy, limited "recommended/default" set from the Go team is definitely helping for all of those. I love the recommendation mechanism too, easy to use to me. As co-maintainer of the Prometheus client_golang I fully support @ArthurSens words around adding a programmatic option e.g. WithGoCollectorRecommendedMetrics()
that uses this. However, that will get you to the B only.
I wonder if A is realistic. Then if A is not possible at the moment, because e.g. OpenTelemetry and/or Prometheus client_golang (potentially popular metric SDKs) want to keep the influence on what's default (the current status quo), than is this proposal still viable?
I think to motivate SDKs to pursue A with Go team, we need to learn more about pros & cons here. What user will get out of it vs SDK adding manually some Go runtime metrics to default based on user feedback and the recent changes to recommended set? Some cons would be potentially different stability guarantees across Go team vs Otel vs Prometheus.
To sum up, is it A? Can we unpack pros & cons here for SDKs to assess those?
TL;DR: Those make sense. /sched/latencies:seconds
feels the most controversial for Prometheus, (usefulness vs cardinality) but only until we can put it in the new type (native histogram), then it should be fine.
Just to evaluate your proposed metrics and contribute to pros & cons of using Go recommended metrics as default, I diffed what client_golang has now vs recommended.
NOTE: All _memstats_
metrics come actually from the new Go runtime metrics, we just kept the name for stability (it's hard to rename metric from the user perspective).
Default runtime metrics from client_golang | Recommended Go runtime |
---|---|
go_gc_duration_seconds | |
go_goroutines | /sched/goroutines:goroutines |
go_info | |
go_memstats_last_gc_time_seconds | |
go_threads | |
go_memstats_alloc_bytes | |
go_memstats_alloc_bytes_total | /gc/heap/allocs:bytes |
go_memstats_sys_bytes | /memory/classes/total:bytes |
go_memstats_lookups_total | |
go_memstats_mallocs_total | kind of /gc/heap/allocs:objects but with tiny allocs |
go_memstats_frees_total | |
go_memstats_heap_alloc_bytes | |
go_memstats_heap_sys_bytes | |
go_memstats_heap_idle_bytes | |
go_memstats_heap_inuse_bytes | |
go_memstats_heap_released_bytes | /memory/classes/heap/released:bytes |
go_memstats_heap_objects | |
go_memstats_stack_inuse_bytes | /memory/classes/heap/stacks:bytes |
go_memstats_stack_sys_bytes | |
go_memstats_mspan_inuse_bytes | |
go_memstats_mspan_sys_bytes | |
go_memstats_mcache_inuse_bytes | |
go_memstats_mcache_sys_bytes | |
go_memstats_buck_hash_sys_bytes | |
go_memstats_gc_sys_bytes | |
go_memstats_other_sys_bytes | |
go_memstats_next_gc_bytes | /gc/heap/goal:bytes |
/gc/gogc:percent | |
/gc/gomemlimit:bytes | |
/sched/gomaxprocs:threads | |
/sched/latencies:seconds |
To sum up, I think I Prometheus is really close to recommended ones, plus I would propose adding /gc/gogc:percent
, /gc/gomemlimit:bytes
and /sched/gomaxprocs:threads
to Prometheus go collector runtime default as those are important runtime variables to consider.
With that.. it's only /sched/latencies:seconds
left, so having a current default PLUS (deduplicated) recommended set as our default might be a potential option to consider depending on:
I wonder what is the exact intention and the end-goal behind this proposal. Is it to:
A. Convince the common instrumentation SDKs to give the Go team control over the default published metrics for the Go runtime? So the largest amount of Go applications possible have those common metrics OOTB, and adopt potential metrics changes as soon as they are rebuilt with a new Go version?
or...
B. To support a certain amount of users who wants to stay with the Go runtime "default" metrics that might change on Go version to version basis and there are fine with that. C. Suggest what SDKs should add manually to the default set of metrics.
It's really C, in practice. B is nice for those that want it, but I don't think A is practical. Everyone is always going to be free to choose what metrics they collect and/or expose at any layer.
Really I think we're just trying to set a better foundation here than the existing, somewhat haphazard, "collect MemStats
and a few other metrics from some of the other runtime
and runtime/debug
functions" that is fairly widespread at this point. It's a starting point for what I hope will be a slow-but-virtuous cycle of the ecosystem informing the recommended set, and the recommended set informing the ecosystem, so we get a high signal-to-noise ratio for observability.
[...] Those [recommended metrics] make sense.
/sched/latencies:seconds
feels the most controversial for Prometheus, (usefulness vs cardinality) but only until we can put it in the new type (native histogram), then it should be fine.
FWIW, my thought was that SDKs can just choose to skip inherently high cardinality types programmatically, like Float64Histogram
.
To sum up, I think I Prometheus is really close to recommended ones, plus I would propose adding
/gc/gogc:percent
,/gc/gomemlimit:bytes
and/sched/gomaxprocs:threads
to Prometheus go collector runtime default as those are important runtime variables to consider.
That's a good sign IMO. I'm supportive of adding those. While they're likely to be exactly the same over time, the fact is that you can mutate automatically at runtime. As above, re: /sched/latencies:seconds
, I think it's fine if SDKs want to leave out certain metrics because they pose problems for collection.
Introduction
With each Go release the set of metrics exported by the
runtime/metrics
grows in size. Not all metrics are applicable to all cases, and it can become difficult to identify which metrics are actually useful. This is especially true for projects like OpenTelemetry and Prometheus which want to export some broadly-applicable Go runtime metrics by default, but the full set is overwhelming and not particularly user-friendly.Another problem with collecting all metrics is cost. The cardinality of the default metric set is closely watched by projects like Prometheus, because downstream users are often paying for the storage costs of these metrics when making use of hosted solutions.
This issue proposes defining a conservative subset of runtime metrics that are broadly applicable, and a simple mechanism for discovering them programmatically.
Proposal
There are two parts to this proposal. The categorization of some metrics as "recommended" by the Go toolchain, and the actual mechanism for that categorization.
To start with, I would like to propose documenting such a set of metrics as "recommended" at the top of the
runtime/metrics
documentation. Each metric is required to have a full rationale explaining its utility and use-cases. The "recommended" set is intended to hold a lot of weight. We need to make sure the reason why we promote a particular metric is well-documented. The "recommended" set of metrics generally follows the compatibility guarantees of the runtime/metrics package. That being said, a metric is unlikely to be promoted to "recommended" if it's not likely to just exist indefinitely. Still, we reserve the right to remove them.Next, we'll add a
Tags []string
field tometric.Description
so that these metrics can be found programmatically. We could get by with a simple boolean field, but that's inflexible. In particular, what I'd like to avoid is having dedicated fields for future categorizations such that they end up non-orthogonal and confusing.The tag indicating the default set will be the string "recommended".
Proposed initial metrics
Below is an initial proposed set of metrics. This list is intended to be a conservative and uncontroversial set of metrics that have clear real-world use-cases.
/gc/gogc:percent
-GOGC
./gc/gomemlimit:bytes
-GOMEMLIMIT
./gc/heap/allocs:bytes
- Total bytes allocated./gc/heap/allocs:objects
- Total individual allocations made./gc/heap/goal:bytes
- GC heap goal.GOGC
andGOMEMLIMIT
, and a close approximation for heap memory footprint./memory/classes/heap/released:bytes
- Current count of heap bytes that are released back to the OS but which remain mapped.GOMEMLIMIT
. It is also necessary to understand what the runtime believes its own physical memory footprint is, as a subtraction from the total./memory/classes/heap/stacks:bytes
- Current count of bytes allocated to goroutine stacks./memory/classes/total:bytes
- Total Go runtime memory footprint.GOMEMLIMIT
. It's also useful for identifying "other" memory, and together with/memory/classes/heap/released:bytes
, what the runtime believes the physical memory footprint of the application is./sched/gomaxprocs:threads
-GOMAXPROCS
./sched/goroutines:goroutines
- Current count of live goroutines (blocked, running, etc.)./sched/latencies:seconds
- Distribution of time goroutines spend runnable (that is, not blocked), but not running.This results in 10
uint64
metrics and 1Float64Histogram
metric in the default set, a significant reduction from the 81 metrics currently exported by the package.Here are a few other metrics that were not included.
/memory/classes/heap/objects:bytes
- Current count of bytes allocated./memory/classes/metadata/other:bytes
- Runtime metadata, mostly GC metadata./gc/heap/frees:bytes
- Total bytes freed./gc/heap/allocs:bytes
. Not that useful on its own, and live+unswept heap memory isn't a terribly useful metric since it tends to be noisy and misleading, subject to sweep scheduling nuances. The heap goal is a much more reliable measure of total heap footprint./gc/heap/frees:objects
- Total individual allocations freed./gc/heap/allocs:objects
. Not that useful on its own, and the number of live objects on its own also isn't that useful. Together with/gc/heap/frees:objects
,/gc/heap/allocs:bytes
, and/gc/heap/frees:bytes
it can be used to calculate average object size, but that's also not very useful on its own. The distribution of object sizes is more useful, but the metric is currently incomplete, as it currently buckets all objects >32 KiB in size together./godebug/non-default/*
- Count of instances of a behavior change due to aGODEBUG
setting.Alternatives
Only documenting the recommended set
One alternative is to only document the set of recommended metrics. This is fine, but it also runs counter to
runtime/metrics
' original goal of being able to discover metrics programmatically. Some mechanism here seems necessary to keep the package useful to both humans and computers.A toolchain-versioned default metrics set
Originally, we had considered an API (for example,
metrics.Recommended(...)
) that accepted a Go toolchain version and would return the set of default metrics (specifically, a[]metrics.Description
) for that version. All the metrics within would always be valid to pass tometrics.Read
.You could also imagine this set being controlled via the language version set in the
go.mod
indirectly viaGODEBUG
flags. (That is, every time we would change this set, we'd add a valid value toGODEBUG
. Specifically something likeGODEBUG=runtimemetricsgo121=1
.)Unfortunately, there are already a lot of questions here about stability and versioning. Least of which is the fact that toolchain versions, at least those reported by the runtime/debug package, aren't very structured.
Furthermore, this is a type of categorization that doesn't really compose well. If we ever wanted new categories, we'd need to define a new API, or possibly dummy toolchain strings. It's also a much more complicated change.