Memory-leak related to the resourcetotelemetry codepath?

diranged commented 3 months ago

Component(s)

exporter/prometheusremotewrite, pkg/resourcetotelemetry

What happened?

Description

We're troubleshooting an issue where _a single otel-collector-... pod out of a group begins to turn away metrics because the memorylimiter is tripped. In this situation, we have dozens or hundreds of clients pushing otlp metrics (prometheus collected, but over the otlp grpc exporter) to multiple otel-collector-... pods. The metrics are routed with the loadbalancer exporter using routing_key: resource.

The behavior we see is that one collector suddenly starts running out of memory and being limited .. while the other collectors are using half or even less memory to process the same number of events. Here's graphs of the ingestion and the success rate:

The two dips in the Percentage of Metrics Accepted by Receiver graphs are different pods in a StatefulSet. Here's the graph of actual memory usage of these three pods:

In the first dip - from 8:30AM->9:30AM, I manually restarted the pod to recover it. It's fine now ... but a few hours later, a different one of the pods becomes overloaded. Grabbing a HEAP dump from the pprof endpoint on a "good" and "bad" pod shows some stark differences:

Bad Pod Pprof: otel-collector-metrics-processor-collector-0.pb.gz

Good Pod Pprof: otel-collector-metrics-processor-collector-1.pb.gz

I should note that this isn't remotely our largest environment - and these pods are handling ~12-15k datapoints/sec, while our larger environments are doing ~20k/sec/pod... so this doesn't feel like a fundamental scale issue. All pods are sized the same across all of our environments.

Collector version

v0.101.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 128
        tls:
          ca_file: /tls/ca.crt
          cert_file: /tls/tls.crt
          client_ca_file: /tls/ca.crt
          key_file: /tls/tls.key
exporters:
  debug:
    sampling_initial: 15
    sampling_thereafter: 60
  debug/verbose:
    sampling_initial: 15
    sampling_thereafter: 60
    verbosity: detailed
  prometheusremotewrite/amp:
    add_metric_suffixes: true
    auth:
      authenticator: sigv4auth
    endpoint: https://...api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
  prometheusremotewrite/central:
    add_metric_suffixes: true
    endpoint: https://..../api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
    tls:
      ca_file: /tls/ca.crt
      cert_file: /tls/tls.crt
      insecure_skip_verify: true
      key_file: /tls/tls.key
  prometheusremotewrite/staging:
    add_metric_suffixes: true
    endpoint: https://.../api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
    tls:
      ca_file: /tls/ca.crt
      cert_file: /tls/tls.crt
      insecure_skip_verify: true
      key_file: /tls/tls.key
processors:
  attributes/common:
    actions:
      - action: upsert
        key: k8s.cluster.name
        value: ...
  batch/otlp:
    send_batch_max_size: 10000
  batch/prometheus:
    send_batch_max_size: 12384
    send_batch_size: 8192
    timeout: 15s
  filter/drop_unknown_source:
    error_mode: ignore
    metrics:
      exclude:
        match_type: regexp
        metric_names: .*
        resource_attributes:
          - key: _meta.source.type
            value: unknown
  filter/find_unknown_source:
    error_mode: ignore
    metrics:
      include:
        match_type: regexp
        metric_names: .*
        resource_attributes:
          - key: _meta.source.type
            value: unknown
  filter/only_prometheus_metrics:
    error_mode: ignore
    metrics:
      include:
        match_type: regexp
        resource_attributes:
          - key: _meta.source.type
            value: prometheus
  k8sattributes:
    extract:
      labels:
        - from: pod
          key: app.kubernetes.io/name
          tag_name: app.kubernetes.io/name
        - from: pod
          key: app.kubernetes.io/instance
          tag_name: app.kubernetes.io/instance
        - from: pod
          key: app.kubernetes.io/component
          tag_name: app.kubernetes.io/component
        - from: pod
          key: app.kubernetes.io/part-of
          tag_name: app.kubernetes.io/part-of
        - from: pod
          key: app.kubernetes.io/managed-by
          tag_name: app.kubernetes.io/managed-by
      metadata:
        - container.id
        - container.image.name
        - container.image.tag
        - k8s.container.name
        - k8s.cronjob.name
        - k8s.daemonset.name
        - k8s.deployment.name
        - k8s.job.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.replicaset.name
        - k8s.statefulset.name
    passthrough: false
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip
      - sources:
          - from: resource_attribute
            name: k8s.pod.uid
      - sources:
          - from: connection
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 10
  transform/clean_metadata:
    log_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
    metric_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
    trace_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
  transform/default_source:
    log_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
    metric_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
    trace_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
  transform/prometheus_label_clean:
    metric_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^datadog.*")
          - delete_matching_keys(attributes, "^host.cpu.*")
          - delete_matching_keys(attributes, "^host.image.id")
          - delete_matching_keys(attributes, "^host.type")
          - replace_all_patterns(attributes, "key", "^(endpoint|http\\.scheme|net\\.host\\.name|net\\.host\\.port)", "scrape.$$1")
      - context: datapoint
        statements:
          - replace_all_patterns(attributes, "key", "^(endpoint)", "scrape.$$1")
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: :1777
  sigv4auth:
    region: eu-west-1
service:
  extensions:
    - health_check
    - pprof
    - sigv4auth
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
  pipelines:
    metrics/prometheus:
      exporters:
        - prometheusremotewrite/amp
        - prometheusremotewrite/central
      processors:
        - memory_limiter
        - transform/default_source
        - filter/only_prometheus_metrics
        - transform/prometheus_label_clean
        - transform/clean_metadata
        - attributes/common
        - batch/prometheus
      receivers:
        - otlp

Log output

No response

Additional context

No response

github-actions[bot] commented 3 months ago

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil
pkg/resourcetotelemetry: @mx-psi

See Adding Labels via Comments if you do not have permissions to add labels yourself.

diranged commented 3 months ago

Attaching the relevent memorylimiter logs... Explore-logs-2024-06-04 13_14_33.txt

mx-psi commented 3 months ago

Could there be some metrics that have a very large number of resource attributes? We end up allocating extra memory of size roughly $\textrm{number of metrics} \times \textrm{avg number of resource attributes}$, so if the number of metrics is not too big, maybe the number of resource attributes explains this.

I am a bit skeptical of this being an issue on pkg/resourcetotelemetry at first, there may be room for improvement but the logic there is pretty simple https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/e7cf56040c1bf33599f149594d9beea6dcc4cb8e/pkg/resourcetotelemetry/resource_to_telemetry.go#L108-L112

and it looks like it allocates exactly what it needs.

diranged commented 3 months ago

@mx-psi, Thanks for the quick response. We do add a decent number of resource attributes to every metric. At this time, we're only dealing with processing metrics collected by the prometheusreceiver. These metrics each get a bunch of standard labels applied by the k8sattributesprocessor:

app_env="xxx",
app_group="xxx",
app_kubernetes_io_instance="xxx",
app_kubernetes_io_managed_by="Helm",
app_kubernetes_io_name="xx",
cloud_account_id="xxx",
cloud_availability_zone="xxx",
cloud_platform="aws_eks",
cloud_provider="aws",
cloud_region="eu-west-1",
cluster="eu1",
component="proxy",
container="istio-proxy",
container_id="xxx",
container_image_name="xxx/istio/proxyv2",
container_image_tag="1.20.6",
host_arch="amd64",
host_id="i-xxx",
instance="xxx:15090",
job="istio-system/envoy-stats-monitor-raw",
k8s_cluster_name="eu1",
k8s_container_name="istio-proxy",
k8s_deployment_name="xxx",
k8s_namespace_name="xxx",
k8s_node_name="xxxeu-west-1.compute.internal",
k8s_node_uid="54084d95-ecc4-406b-8ac7-11c9a9a6bf57",
k8s_pod_name="xxx-6bzl6",
k8s_pod_uid="ca256889-747f-493d-b262-c8c2ae728f5e",
k8s_replicaset_name="xxx",
namespace="xxx",
node_name="xxx.eu-west-1.compute.internal",
os_type="linux",
otel="true",
pod="xxx-6bzl6",
scrape_endpoint="http-envoy-prom",
scrape_http_scheme="http",
scrape_net_host_name="100.64.179.192",
scrape_net_host_port="15090",
service_instance_id="100.64.179.192:15090",

That said .. what this feels like is some kind of a n issue where the GC is unable to clean up the data when the memory_limiter is tripped. So not an ongoing memory leak, but perhaps just a stuck pointer or something that prevents the data from being collected in some situations?

philchia commented 2 months ago

How about we just process the resource attributes in prometheusremotewrite.FromMetrics instead of pkg/resourcetotelemetry?

github-actions[bot] commented 1 week ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil @dashpole
pkg/resourcetotelemetry: @mx-psi

See Adding Labels via Comments if you do not have permissions to add labels yourself.

open-telemetry / opentelemetry-collector-contrib