hashicorp / go-metrics

A Golang library for exporting performance and runtime metrics to external metrics systems (i.e. statsite, statsd)
MIT License
1.46k stars 177 forks source link

add NaNExpiry option and behavior to PrometheusSink #119

Closed mkcp closed 4 years ago

mkcp commented 4 years ago

Also adds doccomments to structs.

The reasoning for this change is quoted from the PrometheusOpts docs:

NanExpire is a flag that allows us to "zero" out gauges and summaries that we have not updated in the expiry interval. We observe that they do not have a number rather than claiming the number is zero. This is useful because Prometheus expects that metrics are always present. With NaNExpire we may initialize metrics before they're first observed and continue to offer some value (NaN) to Prometheus scrapers even if we have not observed a recent value. Counters are simply collected with the last known value because it is safe to do so. When this flag is combined with "initializing" each metric at zero or NaN when the process starts, it addresses the problem of metrics "disappearing" from Prometheus.

In short, this allows us to set a config flag that prevents metrics from being deleted when they expire. Instead they are set to NaN or ignored if we can safely assume their last value.

This new behavior and config allows us to resolve a long-requested pain point with how Consul expires metrics it has not observed within its prometheus_retention_time, which we set PrometheusSink's expiry value with. Workarounds for this have led to users configuring retention times on the order of days, weeks, and months leading to a pathological loss of precision in metrics which do not update often. Not all of these metrics remain valid even if we haven't updated them. With this change we can recommend much shorter retention times which will lead to more accurate measurements and fewer bogus stats.

mkcp commented 4 years ago

Please hold off on reviewing and/or merging this - I've put together an alternate solution that solves the initialization problem as well as the retention and NaN-expiry problem this one PR solves. I'm getting some more feedback before submitting it up as a PR.

mkcp commented 4 years ago

Closed in favor of the approach in #120