prometheus / client_golang

Prometheus instrumentation library for Go applications
https://pkg.go.dev/github.com/prometheus/client_golang
Apache License 2.0
5.41k stars 1.18k forks source link

RFE: ergonomic way to hide "unobserved" gauge (timestamp) values #749

Closed squeed closed 4 years ago

squeed commented 4 years ago

Sorry if this is noise, and feel free to close if so.

TL;DR: Add a GaugeOpt to hide a metric until first observation.

I'm trying to follow what I understand to be best practices around unknown / unobserved Gauge values. Right now, an unobserved Gauge has a value of 0, which is a definitively incorrect value in the case of timestamps. AIUI, the correct behavior in this case is to simply not expose the metric until the process has observed the value.

In my particular case, it's a timestamp of the last time a request was received. (I don't care about rates, only staleness). If the process restarts, it can be hours or even days until the next request comes in, so that means a long time reading a incorrect value. This isn't just a blip of bad data on startup.

In this case I can filter out 0 on the query side, but I can think of cases where 0 is a legitimate value, so using 0 to mean null would be incorrect.

It doesn't look like there's an ergonomic way to express this in client_golang without writing a custom Collector. If this is indeed a best practice around timestamp values, the feature request is a GaugeOpt that hides unobserved metrics.

brian-brazil commented 4 years ago

Can you explain what you ultimately want to use this value for?

squeed commented 4 years ago

Sure. The use-case is kube-proxy.

I have a stochastic stream of events. The rate can realistically be between 10/sec and 1/day. There is a process that applies these changes to the node's iptables. This can also stochastically or persistently fail, thanks to Kernel Fun Times. (For various reasons, a failure to apply is not retried. And, yes, that should be fixed, but it's not critical to this case.)

I'd like to alert on two scenarios:

  1. Iptables is persistently failing, In other words, the last_request_time is much greater than last_applied_time. In this case, I don't care about 0 values, because the inequality works out in my favor.

    • I can't use failure / success rate, because 0 is a legitimate success rate if there are no events
    • I can't compare last_applied_time with the rest of the cluster, because it is correctly bumped on kube-proxy process restart.
  2. For whatever reason, the loop generating the event stream has broken down. I want to find nodes whose last_request_time is much older than the rest of the cluster. Freshly restarted kube-proxy processes make this awkward, since they report a 0. In this case, I can filter out 0 values on the query side, since a Kubernetes cluster up since 1970 would be... surprising.

So, in my particular case, I can work around this by filtering on 0 in my queries, but that's awkward. It seems like we shouldn't be exposing incorrect data in the first place.

/cc @SuperQ - we were chatting about this elsewhere.

brian-brazil commented 4 years ago

So, in my particular case, I can work around this by filtering on 0 in my queries, but that's awkward.

What you would propose would leave you blind if there was a persistent failure since the start time, so I'm not sure you're gaining anything here - you need some PromQL logic one way or the other.

squeed commented 4 years ago

What you would propose would leave you blind if there was a persistent failure since the start time, so I'm not sure you're gaining anything here - you need some PromQL logic one way or the other.

Indeed, that is not queryable regardless of metric value. We can only solve it with a separate signal for "I have established a connection and populated my caches", which is the container going Ready in the world of Kubernetes.

beorn7 commented 4 years ago

Whatever the outcome of the discussion if this is sane at all will be, it is very easy to make this library act as @squeed requested:

var gge = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "just_for_squeed",
        Help: "Perhaps you shouldn't do this…",
    },
    []string{},
)

This gauge will not show up in your exposition until you call With or WithLabelValues. If you call those at the time you Set the gauge, it will show up from that point on:

gge.With(nil).Set(42)
// Or, if you like that more:
gge.WithLabelValues().Set(42)

If your gauge is already a vector anyway, then it's even more straight forward. It's essentially the inverse of the infamous CounterVec problem, where counters only spring into existence upon their first increment.

beorn7 commented 4 years ago

I assume @squeed can now do as he pleases. Please follow up here if I'm wrong. I'll close this issue for now.

squeed commented 4 years ago

Indeed, the vector trick is perfect, thanks!