open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.09k stars 2.38k forks source link

possible memory leak: config with hostmetrics, kubeletstats, prometheus recievers + transform/k8sattributes processors #36351

Open gord1anknot opened 4 days ago

gord1anknot commented 4 days ago

Component(s)

processor/transform

What happened?

Description

Hello! My organization has a helm deployment of opentelemetry collector, and we are seeing what I would describe as a memory leak with one particular daemonset tasked with ingesting prometheus, kubelet, and host metrics from it's node. We have worked around this issue by periodically restarting this workload.

The memory usage comes on very gradually; it takes about two weeks to build up, at which point CPU usage maxes out from a constant loop of garbage collection. At that point, metrics are refused due to this contention.

On August 2nd, we tried splitting the configuration into two daemonsets to isolate log forwarding from metrics when it reaches this condition. The log forwarding configuration does not have this problem.

We observed this issue both before an upgrade to 0.107.0 from 0.92.0 and after a rollback back to 0.92.0 to confirm that the memory issue was unrelated to the upgrade.

I suspect but do not know that this issue comes out of our use of the transform processor, which is why I labeled the component that way. The reason I suspect is because we expanded our usage of it greatly on about July 13th, and the chart I believe shows that the memory issue rises to a problem level faster after this date.

Please see the chart below, going back to May 1st, for a visual on memory usage of our opentelemetry workloads. The cluster-reciever is a singleton pod for k8s cluster metrics and some high memory scrapes, logs-agent is the split logs configuration, and collector is a gateway, but do not have issues.

promql query for the chart seen below

```promql max by (k8s_container_name, k8s_workload_name) ( ( max by (k8s_workload_name, k8s_container_name, k8s_pod_name) ( container_memory_usage{env="prod", k8s_cluster_name=~"prod", k8s_namespace_name=~"opentelemetry-collector", k8s_workload_name=~".*"} ) ) / ( ( max by (k8s_workload_name, k8s_container_name, k8s_pod_name) ( k8s_container_memory_limit{env="prod", k8s_cluster_name=~"prod", k8s_namespace_name=~"opentelemetry-collector", k8s_workload_name=~".*"} ) ) != 0 ) ) ```

image

Steps to Reproduce

We are able to reproduce this issue in lower environments, however, since the issue takes at least 14 days to show up, we cannot iterate very quickly here. Please find a complete configuration for the metrics-agent daemonset below.

Details

```yaml exporters: debug: {} logging: {} otlphttp: endpoint: << EXAMPLE >> splunk_hec/platform_logs: disable_compression: true endpoint: << EXAMPLE >> idle_conn_timeout: 10s index: kubernetes-logging profiling_data_enabled: false retry_on_failure: enabled: true initial_interval: 5s max_elapsed_time: 300s max_interval: 30s sending_queue: enabled: true num_consumers: 10 queue_size: 5000 source: kubernetes splunk_app_name: otel-collector-agent splunk_app_version: 0.78.0 timeout: 10s tls: insecure_skip_verify: true token: ${env:SPLUNK_HEC_TOKEN} extensions: health_check: endpoint: ${env:MY_POD_IP}:13133 zpages: {} processors: batch: {} filter/logs: logs: exclude: match_type: strict resource_attributes: - key: splunk.com/exclude value: "true" filter/metrics: error_mode: ignore metrics: datapoint: - resource.attributes["k8s.namespace.name"] == "cluster-overprovisioner" - resource.attributes["k8s.namespace.name"] == "cluster-scaler" - resource.attributes["k8s.namespace.name"] == "kube-system" - resource.attributes["k8s.container.name"] == "pause" - resource.attributes["k8s.container.name"] == "wait" metric: - name == "dapr_http_client_roundtrip_latency" - name == "dapr_component_pubsub_ingress_latencies" - name == "kubernetes.daemon_set.current_scheduled" - name == "kubernetes.daemon_set.misscheduled" - name == "kubernetes.daemon_set.updated" - name == "kubernetes.deployment.updated" - name == "kubernetes.job.parallelism" - name == "kubernetes.namespace_phase" - name == "kubernetes.stateful_set.updated" - IsMatch(name, "kubernetes.replica_set.*") - IsMatch(name, "kubernetes.replication_controller.*") - IsMatch(name, "kubernetes.resource_quota.*") - IsMatch(name, "openshift.*") k8sattributes: extract: annotations: - from: pod key: splunk.com/sourcetype - from: namespace key: splunk.com/exclude tag_name: splunk.com/exclude - from: pod key: splunk.com/exclude tag_name: splunk.com/exclude - from: namespace key: splunk.com/index tag_name: com.splunk.index - from: pod key: splunk.com/index tag_name: com.splunk.index - from: pod key: examplecompany.net/env tag_name: env - from: pod key: examplecompany.net/role tag_name: role - from: pod key: examplecompany.net/service tag_name: service - from: pod key: examplecompany.net/app tag_name: app - from: pod key: examplecompany.net/version tag_name: version - from: pod key: examplecompany.net/canary tag_name: canary labels: - from: pod key: examplecompany.net/env tag_name: env - from: pod key: examplecompany.net/role tag_name: role - from: pod key: examplecompany.net/service tag_name: service - from: pod key: examplecompany.net/app tag_name: app - from: pod key: examplecompany.net/version tag_name: version - from: pod key: examplecompany.net/canary tag_name: canary metadata: - k8s.cronjob.name - k8s.daemonset.name - k8s.deployment.name - k8s.replicaset.name - k8s.statefulset.name - k8s.job.name - k8s.namespace.name - k8s.node.name - k8s.pod.name - k8s.pod.uid - container.id - container.image.name - container.image.tag filter: node_from_env_var: K8S_NODE_NAME passthrough: false pod_association: - sources: - from: resource_attribute name: k8s.pod.uid - sources: - from: resource_attribute name: k8s.pod.ip - sources: - from: resource_attribute name: ip - sources: - from: connection - sources: - from: resource_attribute name: host.name k8sattributes/prometheus: extract: metadata: - k8s.cronjob.name - k8s.daemonset.name - k8s.deployment.name - k8s.replicaset.name - k8s.statefulset.name - k8s.job.name - k8s.namespace.name - k8s.node.name - k8s.pod.name - k8s.pod.uid - container.id filter: node_from_env_var: K8S_NODE_NAME passthrough: false pod_association: - sources: - from: resource_attribute name: k8s.pod.uid - sources: - from: resource_attribute name: k8s.pod.ip - sources: - from: resource_attribute name: ip - sources: - from: connection - sources: - from: resource_attribute name: host.name memory_limiter: check_interval: 2s limit_percentage: 80 spike_limit_percentage: 25 resource: attributes: - action: insert key: k8s.node.name value: ${env:K8S_NODE_NAME} - action: upsert key: k8s.cluster.name value: prod resource/add_agent_k8s: attributes: - action: insert key: k8s.pod.name value: ${env:K8S_POD_NAME} - action: insert key: k8s.pod.uid value: ${env:K8S_POD_UID} - action: insert key: k8s.namespace.name value: ${env:K8S_NAMESPACE} resource/add_environment: attributes: - action: insert key: env value: prod resource/chrono-instance-id: attributes: - action: insert from_attribute: container.id key: service.instance.id - action: insert from_attribute: k8s.pod.uid key: service.instance.id - action: insert from_attribute: host.name key: service.instance.id resource/logs: attributes: - action: upsert from_attribute: k8s.pod.annotations.splunk.com/sourcetype key: com.splunk.sourcetype - action: upsert from_attribute: k8s.pod.annotations.examplecompany.net/role key: role - action: upsert from_attribute: k8s.pod.annotations.examplecompany.net/service key: service - action: delete key: k8s.pod.annotations.splunk.com/sourcetype - action: delete key: splunk.com/exclude resourcedetection: detectors: - env - gcp - system override: true timeout: 10s transform/add_service_and_role: error_mode: ignore metric_statements: - context: datapoint statements: - set(attributes["service"], resource.attributes["service.name"]) where attributes["service"] == nil - set(attributes["service"], resource.attributes["service"]) where attributes["service"] == nil - set(attributes["role"], resource.attributes["role"]) where attributes["role"] == nil - set(attributes["k8s.pod.name"], resource.attributes["k8s.pod.name"]) where resource.attributes["k8s.pod.name"] != nil - set(attributes["k8s.node.name"], resource.attributes["k8s.node.name"]) where resource.attributes["k8s.node.name"] != nil - set(attributes["k8s.cluster.name"], resource.attributes["k8s.cluster.name"]) where resource.attributes["k8s.cluster.name"] != nil - set(attributes["k8s.container.name"], resource.attributes["k8s.container.name"]) where resource.attributes["k8s.container.name"] != nil - set(attributes["k8s.workload.name"], resource.attributes["k8s.deployment.name"]) where resource.attributes["k8s.deployment.name"] != nil - set(attributes["k8s.workload.name"], resource.attributes["k8s.daemonset.name"]) where resource.attributes["k8s.daemonset.name"] != nil - set(attributes["k8s.workload.name"], resource.attributes["k8s.statefulset.name"]) where resource.attributes["k8s.statefulset.name"] != nil - set(attributes["k8s.workload.name"], resource.attributes["k8s.cronjob.name"]) where resource.attributes["k8s.cronjob.name"] != nil - set(attributes["k8s.workload.name"], resource.attributes["k8s.replicaset.name"]) where resource.attributes["k8s.replicaset.name"] != nil and resource.attributes["k8s.deployment.name"] == nil - | set(attributes["k8s.workload.name"], resource.attributes["k8s.job.name"]) where resource.attributes["k8s.job.name"] != nil and resource.attributes["k8s.cronjob.name"] == nil - set(attributes["k8s.workload.kind"], "deployment") where resource.attributes["k8s.deployment.name"] != nil - set(attributes["k8s.workload.kind"], "daemonset") where resource.attributes["k8s.daemonset.name"] != nil - set(attributes["k8s.workload.kind"], "statefulset") where resource.attributes["k8s.statefulset.name"] != nil - set(attributes["k8s.workload.kind"], "cronjob") where resource.attributes["k8s.cronjob.name"] != nil - | set(attributes["k8s.workload.kind"], "replicaset") where resource.attributes["k8s.replicaset.name"] != nil and resource.attributes["k8s.deployment.name"] == nil - | set(attributes["k8s.workload.kind"], "job") where resource.attributes["k8s.job.name"] != nil and resource.attributes["k8s.cronjob.name"] == nil - set(attributes["k8s.namespace.name"], resource.attributes["k8s.namespace.name"]) where resource.attributes["k8s.namespace.name"] != nil - set(attributes["k8s.workload.name"], attributes["app"]) where attributes["app"] != nil and attributes["k8s.workload.kind"] == "replicaset" - set(attributes["k8s.workload.name"], resource.attributes["app"]) where resource.attributes["app"] != nil and attributes["k8s.workload.kind"] == "replicaset" - set(attributes["k8s.workload.name"], Concat([attributes["app"],attributes["role"]], "-")) where attributes["app"] != nil and attributes["role"] != nil and attributes["k8s.workload.kind"] == "replicaset" - set(attributes["k8s.workload.name"], Concat([resource.attributes["app"],resource.attributes["role"]], "-")) where resource.attributes["app"] != nil and resource.attributes["role"] != nil and attributes["k8s.workload.kind"] == "replicaset" transform/sum_histograms: error_mode: ignore metric_statements: - context: metric statements: - extract_sum_metric(true) where name == "dapr_http_client_roundtrip_latency" - extract_sum_metric(true) where name == "dapr_component_pubsub_ingress_latencies" receivers: hostmetrics: collection_interval: 60s root_path: /hostfs scrapers: cpu: null disk: null filesystem: null load: null memory: null network: null paging: null processes: null jaeger: protocols: grpc: endpoint: ${env:MY_POD_IP}:14250 thrift_compact: endpoint: ${env:MY_POD_IP}:6831 thrift_http: endpoint: ${env:MY_POD_IP}:14268 kubeletstats: auth_type: serviceAccount collection_interval: 60s endpoint: ${env:K8S_NODE_IP}:10250 extra_metadata_labels: - container.id metric_groups: - container - node - pod metrics: k8s.container.cpu_limit_utilization: enabled: true k8s.container.cpu_request_utilization: enabled: true k8s.container.memory_limit_utilization: enabled: true k8s.container.memory_request_utilization: enabled: true otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 prometheus: config: scrape_configs: - enable_http2: true follow_redirects: true job_name: external-dns kubernetes_sd_configs: - namespaces: names: - external-dns role: pod metrics_path: /metrics relabel_configs: - action: keep regex: external-dns source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_name - action: keep regex: ${env:K8S_NODE_NAME} source_labels: - __meta_kubernetes_pod_node_name scheme: http scrape_interval: 1m scrape_timeout: 10s - enable_http2: true follow_redirects: true job_name: daprd kubernetes_sd_configs: - role: pod metrics_path: /metrics relabel_configs: - action: keep regex: "true" source_labels: - __meta_kubernetes_pod_annotation_dapr_io_enable_metrics - action: keep regex: dapr-metrics source_labels: - __meta_kubernetes_pod_container_port_name - action: keep regex: ${env:K8S_NODE_NAME} source_labels: - __meta_kubernetes_pod_node_name scheme: http scrape_interval: 1m scrape_timeout: 30s - enable_http2: true follow_redirects: true job_name: envoy kubernetes_sd_configs: - role: pod metrics_path: /stats/prometheus relabel_configs: - action: keep regex: envoy source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_name - action: keep regex: envoy source_labels: - __meta_kubernetes_pod_container_name - action: keep regex: http-admin source_labels: - __meta_kubernetes_pod_container_port_name - action: keep regex: ${env:K8S_NODE_NAME} source_labels: - __meta_kubernetes_pod_node_name scheme: http scrape_interval: 1m scrape_timeout: 30s - enable_http2: true follow_redirects: true job_name: custom-metrics kubernetes_sd_configs: - namespaces: names: - custom-metrics role: pod metrics_path: /metrics relabel_configs: - action: keep regex: example source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_name - action: keep regex: ${env:K8S_NODE_NAME} source_labels: - __meta_kubernetes_pod_node_name - action: labelmap regex: __meta_kubernetes_pod_annotation_examplecompany_net_(.+) scheme: http scrape_interval: 1m scrape_timeout: 30s - enable_http2: true follow_redirects: true job_name: keda-operator kubernetes_sd_configs: - namespaces: names: - keda role: pod metrics_path: /metrics relabel_configs: - action: keep regex: keda-operator source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_name - action: keep regex: ${env:K8S_NODE_NAME} source_labels: - __meta_kubernetes_pod_node_name scheme: http scrape_interval: 1m scrape_timeout: 30s - enable_http2: true follow_redirects: true job_name: scrape-annotations kubernetes_sd_configs: - role: pod metrics_path: /metrics relabel_configs: - action: drop regex: (kube-system|nginx-ingress-internal|nginx-ingress-external) source_labels: - __meta_kubernetes_namespace - action: keep regex: ${env:K8S_NODE_NAME} source_labels: - __meta_kubernetes_pod_node_name - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: keep regex: ^([^:]+)(?::\d+)?;(\d+)$ replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_annotation_examplecompany_net_(.+) scheme: http scrape_interval: 1m scrape_timeout: 30s - enable_http2: true follow_redirects: true job_name: signalfx-scrape-annotations kubernetes_sd_configs: - role: pod metrics_path: /metrics relabel_configs: - action: drop regex: kube-system source_labels: - __meta_kubernetes_namespace - action: keep regex: ${env:K8S_NODE_NAME} source_labels: - __meta_kubernetes_pod_node_name - action: keep regex: prometheus-exporter source_labels: - __meta_kubernetes_pod_annotation_agent_signalfx_com_monitorType_http - action: keep regex: "80" source_labels: - __meta_kubernetes_pod_container_port_number - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_agent_signalfx_com_config_http_metricPath target_label: __metrics_path__ - action: labelmap regex: __meta_kubernetes_pod_annotation_examplecompany_net_(.+) scheme: http scrape_interval: 1m scrape_timeout: 30s prometheus/agent: config: scrape_configs: - job_name: metrics-agent scrape_interval: 1m static_configs: - targets: - ${env:MY_POD_IP}:8888 zipkin: endpoint: ${env:MY_POD_IP}:9411 service: extensions: - health_check - zpages pipelines: logs: exporters: - splunk_hec/platform_logs processors: - memory_limiter - k8sattributes - filter/logs - resource/logs - resource - resource/add_environment - resourcedetection - batch receivers: - otlp metrics: exporters: - otlphttp processors: - memory_limiter - k8sattributes - filter/metrics - resource/add_environment - resourcedetection - resource/chrono-instance-id - resource - transform/add_service_and_role - batch receivers: - hostmetrics - kubeletstats - otlp metrics/agent: exporters: - otlphttp processors: - memory_limiter - resource/add_agent_k8s - resourcedetection - resource - batch receivers: - prometheus/agent metrics/prometheus: exporters: - otlphttp processors: - memory_limiter - transform/sum_histograms - k8sattributes/prometheus - filter/metrics - resource/add_environment - resourcedetection - resource/chrono-instance-id - resource - transform/add_service_and_role - batch receivers: - prometheus traces: exporters: - otlphttp processors: - k8sattributes - resource/add_environment - resourcedetection - resource - batch receivers: - otlp - jaeger - zipkin telemetry: logs: encoding: json metrics: address: ${env:MY_POD_IP}:8888 ```

I noticed that other memory leak issues usually require the reporter to post a heap pprof. I added pprof to our lower environments. Please find a heap dump of the oldest pod so instrumented (12 days old), unfortunately, it's not churning garbage collection yet though it's getting close.

Unfortunately, I'm running out of time to look at this issue, and I don't have much go experience to understand what I'm looking at in the heap dump. To work around, we have implemented an automatic restart on Mondays, hoping you can help.

Thank you so very much!

pprof.otelcol-contrib.samples.cpu.003.pb.gz pprof.otelcol-contrib.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz

Expected Result

Garbage collection fully reclaims memory from routine operations

Actual Result

Garbage collection doesn't seem to affect some part of overall memory consumption.

Collector version

v0.92.0

Environment information

Environment

OS: GKE / ContainerOS Compiler(if manually compiled): using public docker image

OpenTelemetry Collector configuration

exporters:
  debug: {}
  logging: {}
  otlphttp:
    endpoint: << EXAMPLE >>
  splunk_hec/platform_logs:
    disable_compression: true
    endpoint: << EXAMPLE >>
    idle_conn_timeout: 10s
    index: kubernetes-logging
    profiling_data_enabled: false
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
    source: kubernetes
    splunk_app_name: otel-collector-agent
    splunk_app_version: 0.78.0
    timeout: 10s
    tls:
      insecure_skip_verify: true
    token: ${env:SPLUNK_HEC_TOKEN}
extensions:
  health_check:
    endpoint: ${env:MY_POD_IP}:13133
  zpages: {}
processors:
  batch: {}
  filter/logs:
    logs:
      exclude:
        match_type: strict
        resource_attributes:
        - key: splunk.com/exclude
          value: "true"
  filter/metrics:
    error_mode: ignore
    metrics:
      datapoint:
      - resource.attributes["k8s.namespace.name"] == "cluster-overprovisioner"
      - resource.attributes["k8s.namespace.name"] == "cluster-scaler"
      - resource.attributes["k8s.namespace.name"] == "kube-system"
      - resource.attributes["k8s.container.name"] == "pause"
      - resource.attributes["k8s.container.name"] == "wait"
      metric:
      - name == "dapr_http_client_roundtrip_latency"
      - name == "dapr_component_pubsub_ingress_latencies"
      - name == "kubernetes.daemon_set.current_scheduled"
      - name == "kubernetes.daemon_set.misscheduled"
      - name == "kubernetes.daemon_set.updated"
      - name == "kubernetes.deployment.updated"
      - name == "kubernetes.job.parallelism"
      - name == "kubernetes.namespace_phase"
      - name == "kubernetes.stateful_set.updated"
      - IsMatch(name, "kubernetes.replica_set.*")
      - IsMatch(name, "kubernetes.replication_controller.*")
      - IsMatch(name, "kubernetes.resource_quota.*")
      - IsMatch(name, "openshift.*")
  k8sattributes:
    extract:
      annotations:
      - from: pod
        key: splunk.com/sourcetype
      - from: namespace
        key: splunk.com/exclude
        tag_name: splunk.com/exclude
      - from: pod
        key: splunk.com/exclude
        tag_name: splunk.com/exclude
      - from: namespace
        key: splunk.com/index
        tag_name: com.splunk.index
      - from: pod
        key: splunk.com/index
        tag_name: com.splunk.index
      - from: pod
        key: examplecompany.net/env
        tag_name: env
      - from: pod
        key: examplecompany.net/role
        tag_name: role
      - from: pod
        key: examplecompany.net/service
        tag_name: service
      - from: pod
        key: examplecompany.net/app
        tag_name: app
      - from: pod
        key: examplecompany.net/version
        tag_name: version
      - from: pod
        key: examplecompany.net/canary
        tag_name: canary
      labels:
      - from: pod
        key: examplecompany.net/env
        tag_name: env
      - from: pod
        key: examplecompany.net/role
        tag_name: role
      - from: pod
        key: examplecompany.net/service
        tag_name: service
      - from: pod
        key: examplecompany.net/app
        tag_name: app
      - from: pod
        key: examplecompany.net/version
        tag_name: version
      - from: pod
        key: examplecompany.net/canary
        tag_name: canary
      metadata:
      - k8s.cronjob.name
      - k8s.daemonset.name
      - k8s.deployment.name
      - k8s.replicaset.name
      - k8s.statefulset.name
      - k8s.job.name
      - k8s.namespace.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - container.id
      - container.image.name
      - container.image.tag
    filter:
      node_from_env_var: K8S_NODE_NAME
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: ip
    - sources:
      - from: connection
    - sources:
      - from: resource_attribute
        name: host.name
  k8sattributes/prometheus:
    extract:
      metadata:
      - k8s.cronjob.name
      - k8s.daemonset.name
      - k8s.deployment.name
      - k8s.replicaset.name
      - k8s.statefulset.name
      - k8s.job.name
      - k8s.namespace.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - container.id
    filter:
      node_from_env_var: K8S_NODE_NAME
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: ip
    - sources:
      - from: connection
    - sources:
      - from: resource_attribute
        name: host.name
  memory_limiter:
    check_interval: 2s
    limit_percentage: 80
    spike_limit_percentage: 25
  resource:
    attributes:
    - action: insert
      key: k8s.node.name
      value: ${env:K8S_NODE_NAME}
    - action: upsert
      key: k8s.cluster.name
      value: prod
  resource/add_agent_k8s:
    attributes:
    - action: insert
      key: k8s.pod.name
      value: ${env:K8S_POD_NAME}
    - action: insert
      key: k8s.pod.uid
      value: ${env:K8S_POD_UID}
    - action: insert
      key: k8s.namespace.name
      value: ${env:K8S_NAMESPACE}
  resource/add_environment:
    attributes:
    - action: insert
      key: env
      value: prod
  resource/chrono-instance-id:
    attributes:
    - action: insert
      from_attribute: container.id
      key: service.instance.id
    - action: insert
      from_attribute: k8s.pod.uid
      key: service.instance.id
    - action: insert
      from_attribute: host.name
      key: service.instance.id
  resource/logs:
    attributes:
    - action: upsert
      from_attribute: k8s.pod.annotations.splunk.com/sourcetype
      key: com.splunk.sourcetype
    - action: upsert
      from_attribute: k8s.pod.annotations.examplecompany.net/role
      key: role
    - action: upsert
      from_attribute: k8s.pod.annotations.examplecompany.net/service
      key: service
    - action: delete
      key: k8s.pod.annotations.splunk.com/sourcetype
    - action: delete
      key: splunk.com/exclude
  resourcedetection:
    detectors:
    - env
    - gcp
    - system
    override: true
    timeout: 10s
  transform/add_service_and_role:
    error_mode: ignore
    metric_statements:
    - context: datapoint
      statements:
      - set(attributes["service"], resource.attributes["service.name"]) where attributes["service"]
        == nil
      - set(attributes["service"], resource.attributes["service"]) where attributes["service"]
        == nil
      - set(attributes["role"], resource.attributes["role"]) where attributes["role"]
        == nil
      - set(attributes["k8s.pod.name"], resource.attributes["k8s.pod.name"]) where
        resource.attributes["k8s.pod.name"] != nil
      - set(attributes["k8s.node.name"], resource.attributes["k8s.node.name"]) where
        resource.attributes["k8s.node.name"] != nil
      - set(attributes["k8s.cluster.name"], resource.attributes["k8s.cluster.name"])
        where resource.attributes["k8s.cluster.name"] != nil
      - set(attributes["k8s.container.name"], resource.attributes["k8s.container.name"])
        where resource.attributes["k8s.container.name"] != nil
      - set(attributes["k8s.workload.name"], resource.attributes["k8s.deployment.name"])
        where resource.attributes["k8s.deployment.name"] != nil
      - set(attributes["k8s.workload.name"], resource.attributes["k8s.daemonset.name"])
        where resource.attributes["k8s.daemonset.name"] != nil
      - set(attributes["k8s.workload.name"], resource.attributes["k8s.statefulset.name"])
        where resource.attributes["k8s.statefulset.name"] != nil
      - set(attributes["k8s.workload.name"], resource.attributes["k8s.cronjob.name"])
        where resource.attributes["k8s.cronjob.name"] != nil
      - set(attributes["k8s.workload.name"], resource.attributes["k8s.replicaset.name"])
        where resource.attributes["k8s.replicaset.name"] != nil and resource.attributes["k8s.deployment.name"]
        == nil
      - |
        set(attributes["k8s.workload.name"], resource.attributes["k8s.job.name"]) where resource.attributes["k8s.job.name"] != nil and resource.attributes["k8s.cronjob.name"] == nil
      - set(attributes["k8s.workload.kind"], "deployment") where resource.attributes["k8s.deployment.name"]
        != nil
      - set(attributes["k8s.workload.kind"], "daemonset") where resource.attributes["k8s.daemonset.name"]
        != nil
      - set(attributes["k8s.workload.kind"], "statefulset") where resource.attributes["k8s.statefulset.name"]
        != nil
      - set(attributes["k8s.workload.kind"], "cronjob") where resource.attributes["k8s.cronjob.name"]
        != nil
      - |
        set(attributes["k8s.workload.kind"], "replicaset") where resource.attributes["k8s.replicaset.name"] != nil and resource.attributes["k8s.deployment.name"] == nil
      - |
        set(attributes["k8s.workload.kind"], "job") where resource.attributes["k8s.job.name"] != nil and resource.attributes["k8s.cronjob.name"] == nil
      - set(attributes["k8s.namespace.name"], resource.attributes["k8s.namespace.name"])
        where resource.attributes["k8s.namespace.name"] != nil
      - set(attributes["k8s.workload.name"], attributes["app"]) where attributes["app"]
        != nil and attributes["k8s.workload.kind"] == "replicaset"
      - set(attributes["k8s.workload.name"], resource.attributes["app"]) where resource.attributes["app"]
        != nil and attributes["k8s.workload.kind"] == "replicaset"
      - set(attributes["k8s.workload.name"], Concat([attributes["app"],attributes["role"]],
        "-")) where attributes["app"] != nil and attributes["role"] != nil and attributes["k8s.workload.kind"]
        == "replicaset"
      - set(attributes["k8s.workload.name"], Concat([resource.attributes["app"],resource.attributes["role"]],
        "-")) where resource.attributes["app"] != nil and resource.attributes["role"]
        != nil and attributes["k8s.workload.kind"] == "replicaset"
  transform/sum_histograms:
    error_mode: ignore
    metric_statements:
    - context: metric
      statements:
      - extract_sum_metric(true) where name == "dapr_http_client_roundtrip_latency"
      - extract_sum_metric(true) where name == "dapr_component_pubsub_ingress_latencies"
receivers:
  hostmetrics:
    collection_interval: 60s
    root_path: /hostfs
    scrapers:
      cpu: null
      disk: null
      filesystem: null
      load: null
      memory: null
      network: null
      paging: null
      processes: null
  jaeger:
    protocols:
      grpc:
        endpoint: ${env:MY_POD_IP}:14250
      thrift_compact:
        endpoint: ${env:MY_POD_IP}:6831
      thrift_http:
        endpoint: ${env:MY_POD_IP}:14268
  kubeletstats:
    auth_type: serviceAccount
    collection_interval: 60s
    endpoint: ${env:K8S_NODE_IP}:10250
    extra_metadata_labels:
    - container.id
    metric_groups:
    - container
    - node
    - pod
    metrics:
      k8s.container.cpu_limit_utilization:
        enabled: true
      k8s.container.cpu_request_utilization:
        enabled: true
      k8s.container.memory_limit_utilization:
        enabled: true
      k8s.container.memory_request_utilization:
        enabled: true
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - enable_http2: true
        follow_redirects: true
        job_name: external-dns
        kubernetes_sd_configs:
        - namespaces:
            names:
            - external-dns
          role: pod
        metrics_path: /metrics
        relabel_configs:
        - action: keep
          regex: external-dns
          source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
        - action: keep
          regex: ${env:K8S_NODE_NAME}
          source_labels:
          - __meta_kubernetes_pod_node_name
        scheme: http
        scrape_interval: 1m
        scrape_timeout: 10s
      - enable_http2: true
        follow_redirects: true
        job_name: daprd
        kubernetes_sd_configs:
        - role: pod
        metrics_path: /metrics
        relabel_configs:
        - action: keep
          regex: "true"
          source_labels:
          - __meta_kubernetes_pod_annotation_dapr_io_enable_metrics
        - action: keep
          regex: dapr-metrics
          source_labels:
          - __meta_kubernetes_pod_container_port_name
        - action: keep
          regex: ${env:K8S_NODE_NAME}
          source_labels:
          - __meta_kubernetes_pod_node_name
        scheme: http
        scrape_interval: 1m
        scrape_timeout: 30s
      - enable_http2: true
        follow_redirects: true
        job_name: envoy
        kubernetes_sd_configs:
        - role: pod
        metrics_path: /stats/prometheus
        relabel_configs:
        - action: keep
          regex: envoy
          source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
        - action: keep
          regex: envoy
          source_labels:
          - __meta_kubernetes_pod_container_name
        - action: keep
          regex: http-admin
          source_labels:
          - __meta_kubernetes_pod_container_port_name
        - action: keep
          regex: ${env:K8S_NODE_NAME}
          source_labels:
          - __meta_kubernetes_pod_node_name
        scheme: http
        scrape_interval: 1m
        scrape_timeout: 30s
      - enable_http2: true
        follow_redirects: true
        job_name: custom-metrics
        kubernetes_sd_configs:
        - namespaces:
            names:
            - custom-metrics
          role: pod
        metrics_path: /metrics
        relabel_configs:
        - action: keep
          regex: example
          source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
        - action: keep
          regex: ${env:K8S_NODE_NAME}
          source_labels:
          - __meta_kubernetes_pod_node_name
        - action: labelmap
          regex: __meta_kubernetes_pod_annotation_examplecompany_net_(.+)
        scheme: http
        scrape_interval: 1m
        scrape_timeout: 30s
      - enable_http2: true
        follow_redirects: true
        job_name: keda-operator
        kubernetes_sd_configs:
        - namespaces:
            names:
            - keda
          role: pod
        metrics_path: /metrics
        relabel_configs:
        - action: keep
          regex: keda-operator
          source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
        - action: keep
          regex: ${env:K8S_NODE_NAME}
          source_labels:
          - __meta_kubernetes_pod_node_name
        scheme: http
        scrape_interval: 1m
        scrape_timeout: 30s
      - enable_http2: true
        follow_redirects: true
        job_name: scrape-annotations
        kubernetes_sd_configs:
        - role: pod
        metrics_path: /metrics
        relabel_configs:
        - action: drop
          regex: (kube-system|nginx-ingress-internal|nginx-ingress-external)
          source_labels:
          - __meta_kubernetes_namespace
        - action: keep
          regex: ${env:K8S_NODE_NAME}
          source_labels:
          - __meta_kubernetes_pod_node_name
        - action: keep
          regex: true
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scrape
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_path
          target_label: __metrics_path__
        - action: keep
          regex: ^([^:]+)(?::\d+)?;(\d+)$
          replacement: $1:$2
          source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_annotation_examplecompany_net_(.+)
        scheme: http
        scrape_interval: 1m
        scrape_timeout: 30s
      - enable_http2: true
        follow_redirects: true
        job_name: signalfx-scrape-annotations
        kubernetes_sd_configs:
        - role: pod
        metrics_path: /metrics
        relabel_configs:
        - action: drop
          regex: kube-system
          source_labels:
          - __meta_kubernetes_namespace
        - action: keep
          regex: ${env:K8S_NODE_NAME}
          source_labels:
          - __meta_kubernetes_pod_node_name
        - action: keep
          regex: prometheus-exporter
          source_labels:
          - __meta_kubernetes_pod_annotation_agent_signalfx_com_monitorType_http
        - action: keep
          regex: "80"
          source_labels:
          - __meta_kubernetes_pod_container_port_number
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_agent_signalfx_com_config_http_metricPath
          target_label: __metrics_path__
        - action: labelmap
          regex: __meta_kubernetes_pod_annotation_examplecompany_net_(.+)
        scheme: http
        scrape_interval: 1m
        scrape_timeout: 30s
  prometheus/agent:
    config:
      scrape_configs:
      - job_name: metrics-agent
        scrape_interval: 1m
        static_configs:
        - targets:
          - ${env:MY_POD_IP}:8888
  zipkin:
    endpoint: ${env:MY_POD_IP}:9411
service:
  extensions:
  - health_check
  - zpages
  pipelines:
    logs:
      exporters:
      - splunk_hec/platform_logs
      processors:
      - memory_limiter
      - k8sattributes
      - filter/logs
      - resource/logs
      - resource
      - resource/add_environment
      - resourcedetection
      - batch
      receivers:
      - otlp
    metrics:
      exporters:
      - otlphttp
      processors:
      - memory_limiter
      - k8sattributes
      - filter/metrics
      - resource/add_environment
      - resourcedetection
      - resource/chrono-instance-id
      - resource
      - transform/add_service_and_role
      - batch
      receivers:
      - hostmetrics
      - kubeletstats
      - otlp
    metrics/agent:
      exporters:
      - otlphttp
      processors:
      - memory_limiter
      - resource/add_agent_k8s
      - resourcedetection
      - resource
      - batch
      receivers:
      - prometheus/agent
    metrics/prometheus:
      exporters:
      - otlphttp
      processors:
      - memory_limiter
      - transform/sum_histograms
      - k8sattributes/prometheus
      - filter/metrics
      - resource/add_environment
      - resourcedetection
      - resource/chrono-instance-id
      - resource
      - transform/add_service_and_role
      - batch
      receivers:
      - prometheus
    traces:
      exporters:
      - otlphttp
      processors:
      - k8sattributes
      - resource/add_environment
      - resourcedetection
      - resource
      - batch
      receivers:
      - otlp
      - jaeger
      - zipkin
  telemetry:
    logs:
      encoding: json
    metrics:
      address: ${env:MY_POD_IP}:8888

Log output

│ 2024-07-08T14:16:37.042Z    error    scrape/scrape.go:1351    Scrape commit failed    {"kind": "receiver", "name": "prometheus/nginx", "data_type": "metrics │
│ ", "scrape_pool": "ingress-nginx", "target": "http://10.103.4.10:10254/metrics", "error": "data refused due to high memory usage"}                           │
│ github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1                                                                                  │
│     github.com/prometheus/prometheus@v0.48.1/scrape/scrape.go:1351                                                                                           │
│ github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport                                                                                        │
│     github.com/prometheus/prometheus@v0.48.1/scrape/scrape.go:1429                                                                                           │
│ github.com/prometheus/prometheus/scrape.(*scrapeLoop).run                                                                                                    │
│     github.com/prometheus/prometheus@v0.48.1/scrape/scrape.go:1306                                                                                           │
│ 2024-07-08T14:16:38.438Z    error    scrape/scrape.go:1351    Scrape commit failed    {"kind": "receiver", "name": "prometheus/stackdriver_exporter", "data_ │
│ type": "metrics", "scrape_pool": "stackdriver-exporter", "target": "http://10.103.65.232:9255/metrics", "error": "data refused due to high memory usage"}    │
│ github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1                                                                                  │
│     github.com/prometheus/prometheus@v0.48.1/scrape/scrape.go:1351                                                                                           │
│ github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport                                                                                        │
│     github.com/prometheus/prometheus@v0.48.1/scrape/scrape.go:1429                                                                                           │
│ github.com/prometheus/prometheus/scrape.(*scrapeLoop).run                                                                                                    │
│     github.com/prometheus/prometheus@v0.48.1/scrape/scrape.go:1306                                                                                           │
│ 2024-07-08T14:16:40.026Z    info    memorylimiterprocessor@v0.92.0/memorylimiter.go:280    Memory usage back within limits. Resuming normal operation.    {" │
│ kind": "processor", "name": "memory_limiter", "pipeline": "metrics/nginx", "cur_mem_mib": 768}

Additional context

Although the metrics-agent is configured to receive logs, metrics, and traces over OTLP, it does not do so in practice at this time. None of our services emit otlp metrics to the metrics-agent, only to the gateway deployment, which does not have this issue. On the metrics agent, the ports aren't even exposed. It collects metric signals using hostmetrics, kubeletstats, and prometheus only.

github-actions[bot] commented 4 days ago

Pinging code owners:

TylerHelmuth commented 2 days ago

@gord1anknot can you post pictures of the profiles graphs? Something like https://pprof.me/ is an easy way.

gord1anknot commented 2 days ago

Certainly, a flame graph felt like it would be less useful, so I made this graph of in use memory space image

Here is in-use objects image

This specific pod I pulled the pprof from hasn't reached maximum memory (yet), but it's at 90% and climbing.