open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.95k stars 2.29k forks source link

[processor/metricstransform] Aggregation skipped when label_set is empty but not nil #34430

Open tiffanny29631 opened 1 month ago

tiffanny29631 commented 1 month ago

Component(s)

processor/metricstransform

What happened?

Description

We are upgrading from 0.102.0 to 0.106.0 for CVE fixes and encountered a behavior change in the mentioned processor. With the configuration we have the expectation is no label remains for the metric after aggregation operation, but now all labels fall through and get exported.

When looking at the code previous filter skips if label_set is nil and refactored code skips if length is 0.

I did not find the change of this behavior in the release note, was it intentional? If yes, would omitting this field work? I've tried explicitly setting label_set to nil and the filter is still not working as expected.

Steps to Reproduce

Install otelconribcol 0.102.0 with our ConfigMap, everything works fine;

Upgrade to 0.106.0, the metric that has labe_set:[] starts to getting rejected by our back end as unrecognized labels received.

Expected Result

With label_set:[] all labels associated with a metric are filtered out.

Actual Result

With label_set:[] all labels associated with a metric remain.

Collector version

0.106.0

Environment information

Environment

GKE cluster 1.29.6-gke.1254000

OpenTelemetry Collector configuration

apiVersion: v1
data:
  otel-collector-config.yaml: |-
    receivers:
      opencensus:
    exporters:
      prometheus:
        endpoint: :8675
        namespace: config_sync
        resource_to_telemetry_conversion:
          enabled: true
      googlecloud:
        metric:
          prefix: "custom.googleapis.com/opencensus/config_sync/"
          # The exporter would always fail at sending metric descriptor. Skipping
          # creation of metric descriptors until the error from upstream is resolved
          # The metric streaming data is not affected
          # https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/issues/529
          skip_create_descriptor: true
          # resource_filters looks for metric resource attributes by prefix and converts
          # them into custom metric labels, so they become visible and can be accessed
          # under the GroupBy dropdown list in Cloud Monitoring
          resource_filters:
            - prefix: "cloud.account.id"
            - prefix: "cloud.availability.zone"
            - prefix: "cloud.platform"
            - prefix: "cloud.provider"
            - prefix: "k8s.pod.ip"
            - prefix: "k8s.pod.namespace"
            - prefix: "k8s.pod.uid"
            - prefix: "k8s.container.name"
            - prefix: "host.id"
            - prefix: "host.name"
            - prefix: "k8s.deployment.name"
            - prefix: "k8s.node.name"
        sending_queue:
          enabled: false
      googlecloud/kubernetes:
        metric:
          prefix: "kubernetes.io/internal/addons/config_sync/"
          # skip_create_descriptor: Metrics start with 'kubernetes.io/' have already
          # got descriptors defined internally. Skip sending dupeicated metric
          # descriptors here to prevent errors or conflicts.
          skip_create_descriptor: true
          # instrumentation_library_labels: Otel Collector by default attaches
          # 'instrumentation_version' and 'instrumentation_source' labels that are
          # not specified in our Cloud Monarch definitions, thus skipping them here
          instrumentation_library_labels: false
          # create_service_timeseries: This is a recommended configuration for
          # 'service metrics' starts with 'kubernetes.io/' prefix. It uses
          # CreateTimeSeries API and has its own quotas, so that custom metric write
          # will not break this ingestion pipeline
          create_service_timeseries: true
          service_resource_labels: false
        sending_queue:
          enabled: false
    processors:
      batch:
      # resourcedetection: This processor is needed to correctly mirror resource
      # labels from OpenCensus to OpenTelemetry. We also want to keep this same
      # processor in Otel Agent configuration as the resource labels are added from
      # there
      resourcedetection:
        detectors: [env, gcp]
      filter/cloudmonitoring:
        metrics:
          include:
            # Use strict match type to ensure metrics like 'kustomize_resource_count'
            # is excluded
            match_type: strict
            metric_names:
              - reconciler_errors
              - apply_duration_seconds
              - reconcile_duration_seconds
              - rg_reconcile_duration_seconds
              - last_sync_timestamp
              - pipeline_error_observed
              - declared_resources
              - apply_operations_total
              - resource_fights_total
              - internal_errors_total
              - kcc_resource_count
              - resource_count
              - ready_resource_count
              - cluster_scoped_resource_count
              - resource_ns_count
              - api_duration_seconds
      # Aggregate some metrics sent to Cloud Monitoring to remove high-cardinality labels (e.g. "commit")
      metricstransform/cloudmonitoring:
        transforms:
          - include: last_sync_timestamp
            action: update
            operations:
              - action: aggregate_labels
                label_set:
                  - configsync.sync.kind
                  - configsync.sync.name
                  - configsync.sync.namespace
                  - status
                aggregation_type: max
          - include: declared_resources
            action: update
            operations:
              - action: aggregate_labels
                label_set:
                  - configsync.sync.kind
                  - configsync.sync.name
                  - configsync.sync.namespace
                aggregation_type: max
          - include: apply_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set:
                  - configsync.sync.kind
                  - configsync.sync.name
                  - configsync.sync.namespace
                  - status
                aggregation_type: max
      filter/kubernetes:
        metrics:
          include:
            match_type: regexp
            metric_names:
              - kustomize.*
              - api_duration_seconds
              - reconciler_errors
              - pipeline_error_observed
              - reconcile_duration_seconds
              - rg_reconcile_duration_seconds
              - parser_duration_seconds
              - declared_resources
              - apply_operations_total
              - apply_duration_seconds
              - resource_fights_total
              - remediate_duration_seconds
              - resource_conflicts_total
              - internal_errors_total
              - kcc_resource_count
              - last_sync_timestamp
      # Transform the metrics so that their names and labels are aligned with definition in go/config-sync-monarch-metrics
      metricstransform/kubernetes:
        transforms:
          - include: api_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                # label_set is the allowlist of labels to keep after aggregation
                label_set: [status, operation]
                aggregation_type: max
          - include: declared_resources
            action: update
            new_name: current_declared_resources
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: kcc_resource_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [resourcegroup]
                aggregation_type: max
          - include: reconciler_errors
            action: update
            new_name: last_reconciler_errors
            operations:
              - action: aggregate_labels
                label_set: [component, errorclass]
                aggregation_type: max
          - include: reconcile_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: rg_reconcile_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [stallreason]
                aggregation_type: max
          - include: last_sync_timestamp
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: parser_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status, source, trigger]
                aggregation_type: max
          - include: pipeline_error_observed
            action: update
            new_name: last_pipeline_error_observed
            operations:
              - action: aggregate_labels
                label_set: [name, component, reconciler]
                aggregation_type: max
          - include: apply_operations_total
            action: update
            new_name: apply_operations_count
            operations:
              - action: aggregate_labels
                label_set: [controller, operation, status]
                aggregation_type: max
          - include: apply_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: resource_fights_total
            action: update
            new_name: resource_fights_count
            operations:
              - action: aggregate_labels
                label_set: [name, component, reconciler]
                aggregation_type: max
          - include: resource_conflicts_total
            action: update
            new_name: resource_conflicts_count
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: internal_errors_total
            action: update
            new_name: internal_errors_count
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: remediate_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: kustomize_field_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [field_name]
                aggregation_type: max
          - include: kustomize_deprecating_field_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [deprecating_field]
                aggregation_type: max
          - include: kustomize_simplification_adoption_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [simplification_field]
                aggregation_type: max
          - include: kustomize_builtin_transformers
            action: update
            operations:
              - action: aggregate_labels
                label_set: [k8s_metadata_transformer]
                aggregation_type: max
          - include: kustomize_helm_inflator_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [helm_inflator]
                aggregation_type: max
          - include: kustomize_base_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [base_source]
                aggregation_type: max
          - include: kustomize_patch_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [patch_field]
                aggregation_type: max
          - include: kustomize_ordered_top_tier_metrics
            action: update
            operations:
              - action: aggregate_labels
                label_set: [top_tier_field]
                aggregation_type: max
          - include: kustomize_resource_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: kustomize_build_latency
            action: update
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
    extensions:
      health_check:
    service:
      extensions: [health_check]
      pipelines:
        metrics/cloudmonitoring:
          receivers: [opencensus]
          processors: [batch, filter/cloudmonitoring, metricstransform/cloudmonitoring, resourcedetection]
          exporters: [googlecloud]
        metrics/prometheus:
          receivers: [opencensus]
          processors: [batch]
          exporters: [prometheus]
        metrics/kubernetes:
          receivers: [opencensus]
          processors: [batch, filter/kubernetes, metricstransform/kubernetes, resourcedetection]
          exporters: [googlecloud/kubernetes]
kind: ConfigMap
metadata:
  labels:
    app: opentelemetry
    component: otel-collector
    configmanagement.gke.io/arch: csmr
    configmanagement.gke.io/system: "true"
  name: otel-collector-googlecloud
  namespace: config-management-monitoring

Log output

No response

Additional context

No response

github-actions[bot] commented 1 month ago

Pinging code owners:

main-- commented 1 month ago

I'm observing the same problem. I was wondering why the Honeycomb-provided config was reporting 640 distinct CPU events, and it turns out that all the aggregations are just being ignored. I resolved the problem by downgrading to Version 0.105.0 - now those 640 events are aggregated down to a single one. I'm guessing that #33655 introduced this bug, seeing as it replaced the event aggregation functions with a completely different code path.

tiffanny29631 commented 1 month ago

A potential hack: Latest observation if v0.106 is using a no_op_label instead of label_set:[] or omitting the field can bypass the issue.

Looking at https://github.com/open-telemetry/opentelemetry-collector/blob/main/pdata/pcommon/slice.go#L125 if a given element in label_set does not exist in the original slice, f would still return true indicating the label is to be removed but the re-slice would ignore as the element cannot be found, meanwhile other labels could get aggregated out as designed.