Open AlexZzz opened 4 months ago
SPIRE supports multiple telemetry backends: https://github.com/spiffe/spire/blob/v1.10.0/doc/telemetry_config.md
I think there may have been some work to massage telemetry data into a histogram like structure inside the M3 sink code ... perhaps it is possible to do the same thing for the Prometheus sink? I don't think statsd dogstatsd etc supports it though?
Unfortunately I work only with prometheus, not statsd/m3db/etc. Can't help with them 😞
Here there's something about "bins" in statsd. Probably it's the same as prometheus histograms. I couldn't find something similar for dogstatsd.
Do you think it's possible to make histograms code global? Not backend-dependent as for now done for m3?
I think we're open to emitting some of these metrics as histograms. We'll need someone to figure out the best way to support this generically with our telemetry package (go-metrics) and supported backends.
I dug into this a little bit. The Prometheus sink uses hashicorp/go-metrics
, which uses Summaries by default
for AddSample
and AddSampleWithLabels
. If we want to support histograms, we ultimately need to call Histogram.Observe
instead of Summary.Observe
.
I'm not familiar with all the backends, but the dogstatsd
sink seems to be limited in a similar way, in that Datadog would support histograms if the metrics were submitted that way in the first place.
By way of comparison, the m3 sink depends on uber-go/tally
instead, and adds additional methods to produce a histogramfrom within AddSample
and AddSampleWithLabels
.
So as it stands, I can think of a few options:
go-metrics
for that backend (just like the m3 sink). We'd want to add a configuration option to avoid breaking changes.go-metrics
to add histogram support.Any other ideas?
There're a lot of metrics with type
summary
in spire. These metric type calculates requests count, sum of all latencies and latency distribuition as quantiles. There are two problems with quantiles:Aggregation is the biggest problem in my case. It's impossible to create useful dashboard with a large number of metrics represented as quantiles if there're dozens, hundreds or thousands of spire-agents.
Much better option is to use histogram. One can calculate quantiles from histograms on the database. It won't be as accurate as quantiles from service, but accurate enough for most uses.
It would be nice to have latency histograms in spire:)