open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.85k stars 2.23k forks source link

Proposal: Metric Aggregation Processor #4968

Open huyan0 opened 4 years ago

huyan0 commented 4 years ago

Metric Aggregation Processor Proposal

Objectives

The objective of the metric aggregation processor is to provide OpenTelemetry (OT) Collector users the ability to use exporters to send cumulative metrics to backend systems like Prometheus.

Background

Currently, the Prometheus exporter for the OT Collector is not functioning as expected. This is because OTLP Exporter in the SDK is pass through, and currently the Collector doesn't have any aggregation functionality, so the Collector receives deltas aggregations and instantaneous metric events and export them directly without converting them to cumulative values(#1255). In the Go SDK, aggregation of delta values is performed by the Accumulator and the Integrator(SDK Processor), but there is no similar component in the Collector. Any cumulative exporter that may be implemented in the future will encounter the same problem (a proposed exporter: #1150). A processor that maintains states of metrics (time series) and applies incoming deltas to cumulative values solves this problem.

Requirements

The processor should convert instantaneous and delta OTLP metrics to cumulative OTLP metrics. The following table proposes a mapping from instantaneous and delta metric kinds to cumulative metric kinds based on OTel-Proto PR open-telemetry/opentelemetry-collector#168. In the table, red entries are invalid type and kind combinations, white entries correspond to OTLP metrics that are already cumulative(or gauge), and should pass through the processor, and green entries correspond to OTLP metrics that the processor should aggregate.

image *Grouping scalar indicates the most recent value to occur in a collection interval. **Discussions around PR open-telemetry/opentelemetry-collector#168 are still ongoing.This table reflects a current status; will need to be updated when status changes

For green entries, the processor should maintain the cumulative value as a state. Metric state should be identified by a OTLP metric descriptor and a label set. For each metric of a stateful kind, the processor should aggregate cumulative values for every data point by labels.

Design Idea

The metric aggregation processor should be similar to the SDK Controller; it should contain:

The metric aggregation processor should accept concurrent updates of the metric state by taking in OTLP metrics passed from the pipeline, and perform collection periodically to send cumulative metrics to exporters, acting as a SDK push Controller. The following diagram illustrates the workflow.

image1

cc @alolita @danielbang907 @huyan0 @jmacd

tigrannajaryan commented 3 years ago

We have metrictransform process which does some aggregation. Moving this after GA in case there is more work that we need to do in this area.

nabam commented 3 years ago

I'm happy to help with implementing that functionality. Is someone already working on it? Should such processor belong to core repository or contrib repository? CC: @huyan0 @tigrannajaryan

alolita commented 3 years ago

Hi @nabam We (@alolita @amanbrar1999 @JasonXZLiu) are working on this. You're welcome to code review and provide feedback when we submit a PR.

nabam commented 3 years ago

@alolita that's great! It's going to be very useful for streaming telemetry from AWS Lambdas. Do you have any time estimates for that? I want to understand if we have to build some temporary custom solution while you are working in it

amanbrar1999 commented 3 years ago

I believe that the use cases for this issue have since been solved by changes to the metrics proto, such as the addition of cumulative aggregation temporality. I could be wrong, does anyone know any specific use cases for this processor? @tigrannajaryan @bogdandrutu @nabam

For example I believe this is a use case that has since been resolved: https://github.com/open-telemetry/opentelemetry-collector/issues/1541

In the last comment of this linked issue it is mentioned that temporality translation exists as of OTLP v0.5, which I believe is what this processor is intended for

nabam commented 3 years ago

@amanbrar1999 one of the use-cases would be collecting metrics from short-living jobs such as AWS lambdas. Exporting uniquely tagged cumulative metrics for every lambda instance will cause high cardinality of time series they produce.

amitsaha commented 3 years ago

Having something like this will make https://github.com/open-telemetry/opentelemetry-python/issues/93 possible. Without this, Python WSGI applications using gunicorn and uwsgi and the like will suffer from not being able to use Prometheus exporter directly. In fact, the lack of statsd exporter (https://github.com/open-telemetry/opentelemetry-specification/issues/374) will make prometheus unusable with opentelemetry for this category of applications.

pragmaticivan commented 2 years ago

Howdy! @huyan0 @alolita, was there any progress towards implementing that proposal?

I'm currently validating a standalone (multiple workloads) Otel collector for single page apps/client/StatsD(Metric) collection but found that was not very clear.

I have taken a look at:

Currently investigating this approach https://github.com/open-telemetry/opentelemetry-js/pull/2118 although that will get affected by cardinality.

andyzhangdialpad commented 2 years ago

Thanks for this amazing proposal! @alolita was there any progress on the implementation? We are considering contributing to this. Don't want to reinvent the wheel of course.

chenluo1234 commented 2 years ago

@huyan0 @alolita @amanbrar1999 @JasonXZLiu Any progress on this issue?

I believe that the use cases for this issue have since been solved by changes to the metrics proto, such as the addition of cumulative aggregation temporality. I could be wrong, does anyone know any specific use cases for this processor?

One use case we are having is using statsd receiver and prometheusremotewriteexporter together. Given that statsd receiver produces delta metrics, we will need this processor to convert them into cumulative metrics before handing over to prometheusremotewriteexporter, which only accepts cumulative metrics.

danielgblanco commented 1 year ago

Looking through some of the Metrics Data Model use cases in the specification, I understand any mentions of Collector re-aggregates [e.g. 10s OTLP delta temporality exports] into 60 second resolution would require a processor like the one mentioned in this proposal that can handle state, or a non-OTLP exporter that can aggregate delta sums into cumulative temporality like the prometheusexporter does (although support for delta histograms is pending on https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/9006). As far I can tell this wouldn't be possible to achieve in a single pipeline with OTLP receiver and exporter.

We (Skyscanner) are currently PoC'ing a solution to generate http.*.duration and rpc.*.duration metrics from Envoy spans to collect Istio metrics without Mixer (or Prometheus), which we would like to export in OTLP histograms with delta temporality. We use the spanmetricsprocessor for this, and to avoid exporting a huge number of data points (as the processor is stateless) we currently use cumulative temporality to a prometheusexporter to aggregate histograms and a separate pipeline to scrape that endpoint, use the cumulativetodeltaprocessor to turn them into delta histograms, and then export them via OTLP.

Having something like this would make our pipelines much simpler. Is somebody already working on this? I've also read there may be plans on the transformprocessor to handle aggregations? Any comments/insight would be much appreciated. Thanks!

mplachter commented 1 year ago

A little bump for an update here.

Something like this would reduce the complexity of the otel config for using this for cronjobs and lambda functions. Yes this could be done with the metricstransform by aggregating based on a label but it causes some extra implementation on the dev side or more complex otel config.

danielgblanco commented 1 year ago

The transformprocessor cannot aggregate metrics on consistent intervals in its current form though, as it's stateless. It can only aggregate metrics handled as part of a single batch, not across batches (i.e. exports for different clients).

mplachter commented 1 year ago

@alolita @amanbrar1999 @JasonXZLiu any update on this?

ishworg commented 1 year ago

bump. folks, is there any guidance and/or timeline on this proposal whether accepted or rejected? It has been more than two years now. kindly advise what operators should expect in the future.

AlexeiZenin commented 1 year ago

+1

MovieStoreGuy commented 1 year ago

I can't speak for the maintainers, however, I myself see the benefit of a dedicated processor to perform aggregations as a means of reducing complexity of other processors to reimplement this. I don't think this is strictly related to prom but I see it would greatly benefit from it.

rbizos commented 1 year ago

This feature would be highly beneficial to my team as well, we would be happy to give a hand in the implementation

kovrus commented 1 year ago

I can't speak for the maintainers, however, I myself see the benefit of a dedicated processor to perform aggregations as a means of reducing complexity of other processors to reimplement this. I don't think this is strictly related to prom but I see it would greatly benefit from it.

I second that! I would be happy to sponsor it.

atoulme commented 1 year ago

@kovrus setting this issue as "Accepted Component" with your sponsorship, thanks!

jpettit commented 1 year ago

+1 <3

bitomaxsp commented 1 year ago

@kovrus Hi Ruslan. What does it mean that you are sponsoring? Are you doing all the implemention work? or are you responsible for finding someone who will work on it?

kovrus commented 1 year ago

Hi @bitomaxsp, please take a look at the first point of the Adding new components. It provides a good explanation on sponsorship.

MovieStoreGuy commented 1 year ago

Any way that we can help get this started?

It was brought up in the APAC end user group as something that was needed.

devurandom commented 1 year ago

Would the proposed solution to this issue allow aggregation of metrics data points in user-defined intervals?

We gather metrics using the OpenTelemetry Java Agent, batch them using the OpenTelemetry Collector and export them to Grafana Cloud. Grafana Cloud plans have a maximum included "data points per minute" (DPM, cf. https://grafana.com/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/#data-points-per-minute-dpm) which we would like to not exceed. If the OpenTelemetry Collector would support aggregating each time series in intervals of 60s that would be a great help.

Related: https://github.com/open-telemetry/opentelemetry-java/issues/5393

github-actions[bot] commented 12 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

bitomaxsp commented 12 months ago

These is also a bigger issue when it comes to accumulating intervals from "lambdas". That i think is not covered by OTEL SDK at all iirc. It would be nice to come up with some guides on how to deal with that and include that in that processor as well.

github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

yuri-rs commented 5 months ago

I've achieved my goal of Metric Aggregation by chaining several processors in the following order: batch -> groupbyattrs -> transform/TruncateTime -> metricstransform/aggregate_labels

Although the process might seem extensive, it effectively gets the job done. This approach could be useful for anyone tackling a similar challenge.

axaxs commented 5 months ago

I've achieved my goal of Metric Aggregation by chaining several processors in the following order: batch -> groupbyattrs -> transform/TruncateTime -> metricstransform/aggregate_labels

Although the process might seem extensive, it effectively gets the job done. This approach could be useful for anyone tackling a similar challenge.

Any chance you could expand on your settings? Did you use delta or cumulative counters? Our use case is multiple short lived processes, all writing same metrics and attributes but with different counts. Our expectation is to have otel treat these as cumulative, and be able to export to prometheus, but we've hit a bit of a block as others have stated.

diranged commented 2 months ago

Agreed - @yuri-rs would you mind posting your config?

yuri-rs commented 2 months ago

Agreed - @yuri-rs would you mind posting your config?

Sure @diranged @axaxs. I did collector aggregation for delta metrics. Example below aggregate connector cunters (Deltas) into 5s buckets (from 1s individual measurements) and aggregate multiple series into one keeping only service.name label:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 1s
          static_configs:
            - targets: ['0.0.0.0:8888']

exporters:
  logging:
    verbosity: detailed

processors:
  groupbyattrs:
  batch:
    timeout: 5s
  resource/personal_metrics:
    attributes:
      - action: upsert
        value: "local"
        key: service.name
  transform/count_metric:
    metric_statements:
      - context: resource
        statements:
          - keep_keys(attributes, ["service.name"])
  metricstransform/count_metric:
    transforms:
      - include: "otlp.*"
        match_type: regexp
        action: update
        operations:
          - action: aggregate_labels
            aggregation_type: sum
            label_set: []
  transform/TruncateTime:
    metric_statements:
      - context: datapoint
        statements:
          - set(time, TruncateTime(time, Duration("5s")))

connectors:
  count:
    metrics:
      otlp.collector.metric.count:
    datapoints:
      otlp.collector.metric.data_point.count:

service:
  pipelines:
    metrics/internal:
      receivers: [prometheus]
      processors: [resource/personal_metrics]
      exporters: [count]
    metrics/count:
      receivers: [ count]
      processors: [ transform/count_metric, batch, groupbyattrs, transform/TruncateTime, metricstransform/count_metric ]
      exporters: [ logging ]

  telemetry:
    logs:
      level: debug
robincw-gr commented 1 month ago

Hey @yuri-rs thanks for sharing, that is great. What kind of scale have you tested this at, e.g. active series cardinality and rate of data,, and what were the memory/cpu requirements like?