influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.91k stars 5.6k forks source link

Prometheus style histogram metrics from statsd input plugin #8572

Open lahsivjar opened 3 years ago

lahsivjar commented 3 years ago

Feature Request

A way to generate Prometheus style histogram metrics from statsd input plugin (something similar to https://github.com/atlassian/gostatsd#timer-histograms-experimental-feature)

Proposal:

Statsd Input plugin receives raw data from clients thus it should be possible to maintain counters for a user-defined set of le-buckets(similar to Prometheus). Le labels can be added as tags.

Current behavior:

It is not possible to generate Prometheus style histogram metrics for statsd Input plugin.

Desired behavior:

Generate Prometheus style histogram metrics for statsd Input plugin

Use case:

This will add flexibility for conversion from statsd to Prometheus type data.

lahsivjar commented 3 years ago

https://github.com/lahsivjar/telegraf/commit/43ea032d6e29906e393557b08b8bfbac988c8c0b

A PoC for the proposal.

lahsivjar commented 3 years ago

@danielnelson Would be great if you can take a look at it and give some feedback

lahsivjar commented 3 years ago

There are other options to achieve this for some other input plugins by using aggregators. However, for statsd input the plugin itself does the parsing and aggregation. Because of this, the raw data is lost.

One way to fix this would be to overhaul the statsd input plugin and create a parser for statsd which would generate telegrah metric using the raw statsd data. For aggregations, the end-user can define aggregators to produce the same cumulative effect as the current statsd input plugin.

@danielnelson WDYT?

philomory commented 2 years ago

@lahsivjar Would you be willing to turn your POC into an actual PR (a draft one if you think it isn't ready to be merged as-is)? I think with an actual PR it'd be much more likely to get feedback on it.

Personally, I'd love to see this, whether the we get the histogram-support baked into the statsd plugin alongside all of it's existing built-in aggregation, or if we simply add a statsd parser that can be used with socket_listener to get raw data to pass to ad-hoc aggregators manually, or both.

lahsivjar commented 2 years ago

@philomory Thanks for the ping, I have been away from this project for quite some time. I have created a PR from the PoC commit to initiate conversation, I hope the approach is not completely outdated by now 🤞

jacobstr commented 2 years ago

I'm thinking about this feature myself because it makes me wonder how to get accurate percentiles if we were to scale out telegraf horizontally.

E.g. we could run multiple telegraf replicas and each one maintains some p90 of e.g. a latency measurement from statsd. But a p90(of the p90's across n-replicas) is kind of a meaningless value.

But if each replica exposes histogram buckets, then you can do statistically meaningful percentiles - specifically in the case of input: statsd, output: prometheus flows.

bbkfhq commented 2 months ago

I'd also love this feature. Without this feature as far as I know it's not possible to "aggregate" percentile values from multiple series.

Meaning that, the unique combination of metric field values will produce multiple separate series in Prometheus. Right now you can get a percentile value for each metric but you can't combine them together to get an aggregate value (using the existing percentiles feature of Statsd input).

Example:

api_response_time_ms{endpoint="list_books", server="server1"}

api_response_time_ms{endpoint="add_book", server="server2"}

You'd need the "bucket" data to be able to use histogram_quantile() to get an overall percentile value for api_response_time_ms