open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.01k stars 2.33k forks source link

Some metric names cannot be matched with regex #34376

Closed wilstdu closed 2 months ago

wilstdu commented 2 months ago

Component(s)

receiver/prometheus

What happened?

Description

AWS EKS cluster has OpenTelemetry collector deployed as DeamonSet and uses TargetAllocator to discover metrics endpoints from ServiceMonitors.

Shortened list of metrics I'm trying to ingest:

# HELP tekton_pipelines_controller_pipelinerun_duration_seconds The pipelinerun execution time in seconds
# TYPE tekton_pipelines_controller_pipelinerun_duration_seconds histogram
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",le="43200"} 1
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",le="86400"} 1
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",le="+Inf"} 1
tekton_pipelines_controller_pipelinerun_duration_seconds_sum{namespace="tekton-verification",pipeline="tekton-verification",status="success"} 13.087762487
tekton_pipelines_controller_pipelinerun_duration_seconds_count{namespace="tekton-verification",pipeline="tekton-verification",status="success"} 1

# HELP tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds The pipelinerun's taskrun execution time in seconds
# TYPE tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds histogram
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous",le="43200"} 1
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous",le="86400"} 1
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous",le="+Inf"} 1
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous"} 13.06821713
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_count{namespace="tekton-verification",pipeline="tekton-verification",status="success",task="anonymous"} 1

Service Monitor configuration used to whitelist specific metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-pipelines-controller
spec:
  endpoints:
    - honorLabels: true
      metricRelabelings:
        - action: keep
          regex: (tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum|tekton_pipelines_controller_pipelinerun_duration_seconds_sum)
          sourceLabels:
            - __name__
        - action: replace
          replacement: 'true'
          targetLabel: cx_ingest
      path: /metrics
      port: http-metrics
      scheme: http
  namespaceSelector:
    matchNames:
      - tekton-pipelines
  selector:
    matchLabels:
      app: tekton-pipelines-controller

In Otel agent configuration there are no additional filters - it's just a direct passthrough from TargetAllocator created scrape_configs.

Expected Result

Both tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum and tekton_pipelines_controller_pipelinerun_duration_seconds_sum metrics are ingested and other metrics are discarded

Actual Result

tekton_pipelines_controller_pipelinerun_duration_seconds_sum - ingested tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum - not tekton_pipelines_controller_pipelinerun_taskrun_durationseconds(.*), - then tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum, _bucket, and _count are ingested

When trying to use the wildcard to allow ingest a bit more and then additionally drop _bucket, and _count metrics - this also doesn't work.

It may be related to the length of the metric name and adding a wildcard at the end allows the metric to be ingested. Additional observation that none of the metrics longer than tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_sum can be whitelisted without wildcard at the end.

Collector version

otel/opentelemetry-collector-contrib:0.96.0

Environment information

Environment

Cloud, AWS EKS DeamonSet

OpenTelemetry Collector configuration

!!!This is shortened version of the actual config with only relevant parts!!!

exporters:
  coralogix:
  debug: {}
  logging: {}
extensions:
processors:
  attributes/service_monitors:
    actions:
    - action: delete
      key: cx_ingest
  filter/reducer:
    metrics:
      include:
        expressions:
        - Label("cx_ingest") == "true"
        - MetricName == "system.cpu.time"
        - MetricName == "system.memory.usage"
        - MetricName == "system.disk.io"
        - MetricName == "system.network.io"
        - MetricName == "k8s.pod.cpu.time"
        - MetricName == "k8s.pod.cpu.utilization"
        - MetricName == "k8s.pod.network.io"
        - MetricName == "k8s.pod.memory.usage"
        - MetricName == "k8s.node.cpu.utilization"
        - MetricName == "container.cpu.utilization"
        - MetricName == "container.cpu.time"
        - MetricName == "k8s.node.network.io"
        - MetricName == "k8s.node.filesystem.available"
        - MetricName == "container.memory.usage"
        match_type: expr
receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 30s
        static_configs:
        - targets:
          - ${MY_POD_IP}:8888
    target_allocator:
      collector_id: ${MY_POD_NAME}
      endpoint: http://coralogix-opentelemetry-targetallocator
      interval: 30s
service:
  extensions:
  pipelines:
    metrics:
      exporters:
      - coralogix
      processors:
      - filter/reducer
      - attributes/service_monitors
      receivers:
      - prometheus
  telemetry:
    logs:
      encoding: json
      level: 'warn'
    metrics:
      address: ${MY_POD_IP}:8888
    resource:
    - service.instance.id: null
    - service.name: null

Log output

No response

Additional context

No response

github-actions[bot] commented 2 months ago

Pinging code owners:

dashpole commented 2 months ago

First, note that you have "keep" for the action, so the other series should be discarded, and the ones that match the regex will be kept.

These are histogram metrics, so the resulting metric should be named tekton_pipelines_controller_pipelinerun_duration_seconds or tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds

If you only keep the _sum series, the collector may drop your histogram entirely, as it won't be a valid Histogram. I haven't tested it, but you might get strange behavior doing this.

It shouldn't have anything to do with the length of the regex or the length of the metric.

wilstdu commented 2 months ago

@dashpole thank you for the explanation.

I figured out what was different in my regex between the two histogram metrics and noticed that for tekton_pipelines_controller_pipelinerun_duration_seconds I whitelisted both _sum and _count, when I tried the same for tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds - then it worked.

So to sum it up - to ingest histogram metric you need to whitelist _sum and _count series, otherwise the metric will be rejected.

For the case I was trying to resolve - this helped, because _bucket metric was the problem since it produced lots of data that was not used.