open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.1k stars 2.39k forks source link

[processor/deltatocumulative] enhancements for slow-moving ("sparse") counters #36485

Open sirianni opened 19 hours ago

sirianni commented 19 hours ago

We are using the deltatocumulative processor in production but are facing issues with slow-moving (or "sparse") counters.

In the current implementation, the processor only emits the cumulative when an upstream delta is received. If a counter has not been incremented for several reporting intervals this means that no datapoint is emitted downstream for those intervening time windows. Contrast this with a true cumulative counter in Prometheus, where an unchanged value will be sampled on each successive scrape and emitted downstream.

Current behavior

Time:       |----|----|----|----|----|----|----|----|----|----|----
Delta:      .    5    .    .    10   .    .    .    7    .    .
Cumulative: .    5              15                  22
Emitted:    .    ●              ●                   ●

This behavior causes issues when trying to use rate() and increase() in PromQL since there is no previous datapoint within the standard 5 minute lookback window to compare with.

This could be addressed by an alternative implementation where the cumulative datapoints were instead flushed periodically from a background thread on a fixed interval. This would have the benefit of continuing to emit cumulative counters that have not been recently incremented, but are not yet stale.

Desired behavior

Time:       |----|----|----|----|----|----|----|----|----|----|----
Delta:      .    5    .    .    10   .    .    .    7    .    .
Cumulative: .    5    5    5    15   15   15   15   22   22   22
Emitted:    ●    ●    ●    ●    ●    ●    ●    ●    ●    ●    ●

Stale timeseries would still be expired much like the current implementation.

There are several ways to incorporate this into the existing codebase

  1. Enhancement to deltatocumulativeprocessor
  2. Enhancement to intervalprocessor
  3. New processor (deltatocumulativeasyncprocessor?)

The implementation can get fairly tricky because you'd need to retain the resource/scope attributes, etc. from the original metric.

Example configuration

deltatocumulative:
  mode: async
  flush_unchanged: true #default
  flush_interval: 1m
  max_stale: 1h

Telemetry data types supported

Metrics

Code Owner(s)

@RichieSams @sh0rez

Sponsor (optional)

@RichieSams @sh0rez

Additional context

CNCF Slack Thread

sh0rez commented 19 hours ago

hi! thanks for opening this issue!

I have a lot of ideas for this and will write them down when I find the time.

In the meantime, wdyt about removing the "new component" from this? I'm fairly sure we can find a place within deltatocumulative / interval for the needed functionality, especially because it sounds rather common and likely happens for a lot of users.

New component issues are afaict concrete proposals and we are not quite there yet :) Let's have a general disussion and see where we get

sirianni commented 17 hours ago

Done. I agree, but created it that way since it was suggested to use the "proposal" template. Can you please update the labels as needed? I don't have access to add/remove labels.