statsd to Opentelemetry conversion issue

povilasv commented 1 year ago

Relevant telegraf.conf

[global_tags]
  env = "stage"
[agent]
  interval = "5s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "5s"
  precision = ""
  debug = true
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"
  logfile_rotation_max_size = "5000MB"
  logfile_rotation_max_archives = 2
  log_with_timezone = "local"
  omit_hostname = true
[[inputs.statsd]]
   protocol = "udp"
   max_tcp_connections = 250
   tcp_keep_alive = false
   service_address = ":8080"
   delete_gauges = true
   delete_counters = true
   delete_sets = true
   delete_timings = true
   percentiles = [90.0, 95.0, 99.0]
   metric_separator = "."
   parse_data_dog_tags = false
   allowed_pending_messages = 100000
   percentile_limit = 100
[[outputs.opentelemetry]]
    service_address = "endpont-here"
    compression = "gzip"


### Logs from Telegraf

```text
unrelated

System info

1.25.0

Docker

No response

Steps to reproduce

Run this app https://github.com/devlibx/gox-metrics
Setup Telegraf
Send metrics to OpenTelemetry backend ...

Expected behavior

When delete_counters= true, the OpenTelemetry Counter should be sent with Delta Temporality. As Telegraf accumulates data for every interval and then resets counters. Example we send 200 every interval, we see telegraf outputs following:

T0: 200 T1: 200 T2: 200

Which is Delta Temporality, more info -> https://opentelemetry.io/docs/reference/specification/metrics/supplementary-guidelines/#aggregation-temporality

When delete_counters= false, the OpenTelemetry Counter should be sent with Cumulative Temporality. As Telegraf accumulates the data and reports cumulative counter. For example:

T0: 200 T1: 400 T2: 600

Actual behavior

We always send Temporality as Cumulative in this code:

https://github.com/influxdata/influxdb-observability/blob/28e34bc69b46ddaab673aa784e67e467f08902e8/influx2otel/metrics.go#L158-L161

Additional info

No response

povilasv commented 1 year ago

The solution to this problem could be adding special tag to the metrics, for example when delete_counters=true, we could add temporality=delta, when false, temporality=cumulative, similiar to metric_type=counter. Then add some autodetection logic in the library.

WDYT @jacobmarble ?

jacobmarble commented 1 year ago

I think this is a good idea. As I understand (please correct me):

The OpenTelemetry metrics data model for Sum, Histogram, and ExponentialHistogram requires an "aggregation temporality" attribute.

The Telegraf statsd input plugin more-or-less coerces the statsd data model to the Prometheus data model. Both statsd and Prometheus don't signal "aggregation temporality", but statsd is compatible with "delta" and Prometheus is compatible with "cumulative".

The Telegraf statsd input plugin has option delete_counters to signal whether Telegraf should accumulate (making compatible with Prometheus), or pass through, sum and histogram values.

The influx2otel metrics conversion doesn't have a way to detect aggregation temporality, in fact, it is hard coded to signal cumulative, because it assumes Prometheus data model.

Certain line protocol inputs, such as statsd-via-Telegraf, are capable of generating "delta aggregation temporality", so we should add support somehow, without breaking too many things.

How to fix?

I agree that we need a line protocol tag to signal aggregation temporality, and not simply assume that all metrics are cumulative.
Because this new tag specifically satisfies a concern in the OpenTelemetry data model, it should have a name such as otel.aggregation_temporality, with values delta and cumulative.
- When the new tag is missing, assume cumulative (as is the behavior now).
If the accumulation lives in Telegraf (the case in the statsd input plugin when delete_counters = false, correct?) then we should add the start time to the line protocol data model as field otel.start_time_unix_nano. In a related issue, I indicated that a metric start time should correlate with the time the accumulation began, which is possible in this case, because the statsd plugin is accumulating.

For context, otel2influx drops delta metrics in the Prometheus output schemas. That could be similarly adjusted, separately.

@povilasv please share your thoughts on the above.

povilasv commented 1 year ago

Hey @jacobmarble sorry for late response. Everything you wrote makes sense and is correct.

I have a proof of concept that works if you build telegraf with these two PRs:

Could you take a look at them?

Currently I add start_time as the tag, but maybe instead we should change the core Metric type?

Roughly I am thinking Accumuluators allow you to pass multiple time.Time, example:

func (ac *accumulator) AddGauge(
    measurement string,
    fields map[string]interface{},
    tags map[string]string,
    t ...time.Time,
)

So we can do AddGauge(..., metricTime, startTime)

and then change metric types and interfaces to support SetStartTime(), etc . Somewhere here: https://github.com/influxdata/telegraf/blob/master/metric.go#L107

@powersj / @jacobmarble / @srebhan what you all think?

I think updating core models might be hard as lots of downstream dependencies, databases storing stuff, etc. So let's align what is the best approach here and what should we do?

powersj commented 1 year ago

I think updating core models might be hard as lots of downstream dependencies, databases storing stuff, etc.

agreed

Currently I add start_time as the tag, but maybe instead we should change the core Metric type?

maybe I missed a point, but why would you not use the tag?

povilasv commented 1 year ago

Because it might cause issues in downstream systems. For example if you use prometheus output with statsd input, we will start generating these kinds of metrics:

counter{metric_type="counter",namespace="$NAMESPACE",nodename="$NODENAME",start_time="2023-04-20T11:27:00.000797804+03:00",temporality="delta",type="app"} 10
counter{metric_type="counter",namespace="$NAMESPACE",nodename="$NODENAME",start_time="2023-04-20T11:27:10.000731998+03:00",temporality="delta",type="app"} 10

And this creates a huge cardinality issue in Prometheus which will eventually make Prometheus OOM. More information: https://stackoverflow.com/questions/46373442/how-dangerous-are-high-cardinality-labels-in-prometheus

So what we basically want is somehow pass the start_time, but make it in a way that's invisible to the output plugins that don't need that.

povilasv commented 1 year ago

Another note: the current Prometheus output is currently wrong when using with statsd, as it doesnt convert delta to commulative. So maybe folks dont really use this in practice?

Hipska commented 1 year ago

Also know you can filter the tags you don't want on an output config.

powersj commented 1 year ago

And this creates a huge cardinality issue in Prometheus which will eventually make Prometheus OOM.

I did not realize this was an issue for Prometheus similar to influx

So what we basically want is somehow pass the start_time, but make it in a way that's invisible to the output plugins that don't need that.

Instead of at output plugins, is this something that could be handled at the parser level? If the parser see's the tag that Jacob suggested it goes looking for the corresponding start date field.

povilasv commented 1 year ago

Instead of at output plugins, is this something that could be handled at the parser level? If the parser see's the tag that Jacob suggested it goes looking for the corresponding start date field.

I don't think I follow what you mean. Could you elaborate? Or point to relevant code?

My POC currently computes start time counting the calls between Gather calls - https://github.com/influxdata/telegraf/pull/13087/files#diff-bacab159d6514066d37326f61ab09f62bbe376e1421b9e37b8daacd6df14807aR160

powersj commented 1 year ago

Step 1: get the data - as @jacobmarble mentioned let's add a config option to the statsd option that will set a field (otel.start_time_unix_nano) and tag (otel.aggregation_temporality). This is a modification to your PR in #13087.

This gives us an opt-in method of adding this data to avoid breaking existing users and provides us a way to collect it and know if we have it.

Step 2: Now that you have the data, update the opentelementy output and/or the prometehus serializer with a config option that when set, looks at a specific tag to set the start time.

I think this is what @jacobmarble was suggesting in his previous comment.

povilasv commented 1 year ago

Thanks for explaining @powersj :bow: . I've opened PR for the 1st part https://github.com/influxdata/telegraf/pull/13087 tested together with https://github.com/influxdata/influxdb-observability/pull/160 and it seems to do work :)

povilasv commented 1 year ago

Status update, two PRs got merged https://github.com/influxdata/telegraf/pull/13087 and https://github.com/influxdata/influxdb-observability/pull/160

now we need:

a new release of influx2otel (@jacobmarble maybe you can help with that?)
make a PR to update the library in this repository.

povilasv commented 1 year ago

Ok atm we cannot update telegraf to use new libraries because OTEL go sdk v1.15.1 doesn't build for bsd systems. See https://app.circleci.com/pipelines/github/influxdata/telegraf/16298/workflows/01fa2768-354f-407f-8222-8c79914fef32/jobs/254646

The upstream bug was fixed by this https://github.com/open-telemetry/opentelemetry-go/pull/4077 PR.

We need to wait untill new version of otel go sdk be released then update the dependency in otel2influx library and then update telegraf.

powersj commented 1 year ago

@povilasv it looks like the BSD fix was released a couple days ago in v1.16.0/v0.39.0. What needs to happen next?

povilasv commented 1 year ago

Thanks for the ping.

Working on the PR to otel2influx library -> https://github.com/influxdata/influxdb-observability/pull/244

influxdata / telegraf