influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Prometheus Processor Plugin #14006

Closed crflanigan closed 1 year ago

crflanigan commented 1 year ago

Use Case

For metrics that are being sent to a Prometheus sink, metrics should be typed similar to how opentelemetry.go handles it here.

It would be great if there was a processor plugin which could be optionally used to type the metrics.

Expected behavior

Metrics receive a type.

Actual behavior

This doesn't exist as far as I'm aware.

Additional info

It appears that Windows PerfMon metrics are untyped, while other metrics like inputs.cpu appear to be typed. In testing we found that when we send metrics through a Telegraf server via the opentelemetry inputs/outputs plugins it types them correctly and we're hoping for a processor plugin which can emulate this behavior when we aren't sending to a Telegraf server or intermediate proxy.

powersj commented 1 year ago

What output plugin are you using? Are you also using the prometheus serializer?

The prometheus serializers recently got the ability for the user to specify what fields should be counters or gauges.

crflanigan commented 1 year ago

Hi @powersj,

In this case we're using outputs.stackdriver to send the metrics. Something that we found recently was that most of the Windows PerfMon input plugins are showing that they are untyped. I think in the past you mentioned that all metrics from a Telegraf perspective are untyped. We didn't notice this though because we were sending everything from the agent to a Telegraf server proxying the metrics to Google, using the opentelemetry plugins as a go-between. When you send using those plugins they type the metrics, but if you send directly using outputs.stackdriver directly to Google the metrics are untyped, thus being sent both as counter and untyped, per Google's request. This is a problem because this means that some metrics will be in a different metrics path than others, as well as double ingesting the untyped metrics.

We discussed using a data_format serializer in the past, but found that it wasn't feasable with the current outputs.stackdriver design, and modifying the behavior in the plugin wasn't desirable because we don't want to impact legacy users of the plugin either. At the time I was just using the regex processor plugin to normalize metrics, but it doesn't account for type. I was thinking it would be cool to have a processor which attaches a type to all metrics before it's flushed to the output plugin. I'm also wondering if I could use starlark to do this, though I've never played with it before.

What do you think?

powersj commented 1 year ago

In this case we're using outputs.stackdriver to send the metrics.

I thought it would be this one, but I didn't want to assume in case you are using something else now ;)

I was thinking it would be cool to have a processor which attaches a type to all metrics before it's flushed to the output plugin.

Let me talk to Sven about this since we just had a discussion about this when we added https://github.com/influxdata/telegraf/pull/13874

The thought was that the only time someone is producing Prometheus style metrics is really when they use the prometheus outputs which also in turn use the Prometheus serializer. The Stackdriver out, being the special plugin that it is, breaks that assumption with how you are now using it.

crflanigan commented 1 year ago

Got it!

Thanks buddy!

powersj commented 1 year ago

After discussing, we are considering that this would be a new option to the stackdriver output that mirrors what we did in #13874:

  ## This overrides the metric-type of the Telegraf metric. Globbing is allowed.
  # [[outputs.stackdriver]]
  #   metric_counter = []
  #   metric_gauge = []

We only support gauge and counters today in the plugin so we would only add these two for now as well. The user would provide a metric name, with globs allowed, and metrics matching those would have their type set.

When sending via the metric name format of "official", we also send a counter today. By adding the above options it would essentially let you avoid the unknown metric, and allow you to specify gauges.

Thoughts?

crflanigan commented 1 year ago

I think that sounds great!

powersj commented 1 year ago

@crflanigan I have created #14017 as a draft with the new config options.

These config options take metric names, not field names, and can take wild cards as well. Could you give it a try and let me know if it resolves the cases where you have unknown metrics?

Thanks!

crflanigan commented 1 year ago

Testing now!

crflanigan commented 1 year ago

This looks good!

One thing we noticed though. When we send metrics to Google now from Stackdriver all of our metrics end in _gauge, which we have had some complaints about, but figured it was just the way it was. This new way you're sending doesn't include the _gauge which is desirable.

Example: win_cpu_Percent_Processor_Time vs win_cpu_Percent_Processor_Time_gauge

It looks like this code is responsible for this:

func (s *Stackdriver) generateMetricName(m telegraf.Metric, key string) string {
    // ...
    var kind string
    switch m.Type() {
    case telegraf.Gauge:
        kind = "gauge"
    case telegraf.Untyped:
        kind = "unknown"
    case telegraf.Counter:
        kind = "counter"
    case telegraf.Histogram:
        kind = "histogram"
    default:
        kind = ""
    }
    return path.Join(s.MetricTypePrefix, name, kind)
}

I realize this would be a separate issue, but would that be easy to change to drop the _gauge?

Anyway, I think this looks great, and after it's merged I could probably add this to the Telegraf server configuration so that those metrics match what's being sent here.

powersj commented 1 year ago

There are two naming schema options, I am assuming you are using "official" given we added that based on your issue in #13303. For the official naming I would expect something formed using the following format:

metricTypePrefix/namespace_metricname_fieldname/type

For example in my own test cases using the official naming method produces:

custom.googleapis.com/test_mem_g_value/gauge

I realize this would be a separate issue, but would that be easy to change to drop the _gauge?

Hmm the code you linked to is not appending with an underscore. It is joining values with paths, meaning a forward slash like the items above.

crflanigan commented 1 year ago

Hmm the code you linked to is not appending with an underscore. It is joining values with paths, meaning a forward slash like the items above.

OK, just a guess. I wonder if it's when we pass the metrics through OpenTelemetry plugin inputs/outputs that's causing it.

crflanigan commented 1 year ago

Either way, this is working as far as I can tell, though I needed to add the regex processor again to fix an issue with Windows PerfMon input plugin metrics having a / and a .

Example:

[[inputs.win_perf_counters]]
  alias = "pc_store_win_physicaldiskmetrics"
  interval = "1m"
    [[inputs.win_perf_counters.object]]
      ObjectName = "PhysicalDisk"
      Instances = ["_Total"]
      Counters = ["Avg. Disk sec/Read", "Avg. Disk sec/Transfer", "Avg. Disk sec/Write"] 

I guess something for people to keep in mind.

Thanks buddy!

powersj commented 1 year ago

I wonder if it's when we pass the metrics through OpenTelemetry plugin inputs/outputs that's causing it.

Ah yes, I forget you have this in your setup. I do believe the otel2influx library will use the types, like gauge, etc. as the field name, which is then how you get it in your metrics. See metrics notes as well.

Either way, this is working as far as I can tell,

Thanks for testing it out!

crflanigan commented 1 year ago

You bet!

Thanks for the quick fix!