Histogram instead of summary with quantile for latency metrics

AlexZzz commented 4 months ago

Version: any
Platform: server and agent
Subsystem: any

There're a lot of metrics with type summary in spire. These metric type calculates requests count, sum of all latencies and latency distribuition as quantiles. There are two problems with quantiles:

It's hard to compute. Comuptation is running in the software (in spire in this case), not in the database, so it costs some CPU on every request
It's impossible to aggregate. Any aggregation on quantile will make no sense

Aggregation is the biggest problem in my case. It's impossible to create useful dashboard with a large number of metrics represented as quantiles if there're dozens, hundreds or thousands of spire-agents.

Much better option is to use histogram. One can calculate quantiles from histograms on the database. It won't be as accurate as quantiles from service, but accurate enough for most uses.

It would be nice to have latency histograms in spire:)

evan2645 commented 4 months ago

SPIRE supports multiple telemetry backends: https://github.com/spiffe/spire/blob/v1.10.0/doc/telemetry_config.md

I think there may have been some work to massage telemetry data into a histogram like structure inside the M3 sink code ... perhaps it is possible to do the same thing for the Prometheus sink? I don't think statsd dogstatsd etc supports it though?

AlexZzz commented 3 months ago

Unfortunately I work only with prometheus, not statsd/m3db/etc. Can't help with them 😞

Here there's something about "bins" in statsd. Probably it's the same as prometheus histograms. I couldn't find something similar for dogstatsd.

Do you think it's possible to make histograms code global? Not backend-dependent as for now done for m3?

azdagron commented 3 months ago

I think we're open to emitting some of these metrics as histograms. We'll need someone to figure out the best way to support this generically with our telemetry package (go-metrics) and supported backends.

heymarcel commented 2 months ago

I dug into this a little bit. The Prometheus sink uses hashicorp/go-metrics, which uses Summaries by default for AddSample and AddSampleWithLabels. If we want to support histograms, we ultimately need to call Histogram.Observe instead of Summary.Observe.

I'm not familiar with all the backends, but the dogstatsd sink seems to be limited in a similar way, in that Datadog would support histograms if the metrics were submitted that way in the first place.

By way of comparison, the m3 sink depends on uber-go/tally instead, and adds additional methods to produce a histogramfrom within AddSample and AddSampleWithLabels.

So as it stands, I can think of a few options:

Add histogram data to SPIRE by modifying one backend at a time, and dropping the dependency on go-metrics for that backend (just like the m3 sink). We'd want to add a configuration option to avoid breaking changes.
Modify go-metrics to add histogram support.
Replacing the use of go-metrics entirely and going with something new (e.g. OpenTelemetry), more broadly supported, and more flexible. This seems like a breaking change.

Any other ideas?

spiffe / spire

Histogram instead of summary with quantile for latency metrics #5306