open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.11k stars 2.39k forks source link

AWS CloudWatch logs for Container Insights contain no CPU usage metrics when setting collection_interval to more than 300s #36109

Open oleksandr-san opened 3 weeks ago

oleksandr-san commented 3 weeks ago

Component(s)

receiver/awscontainerinsight

What happened?

Description

We've tried to increase the collection_interval parameter for the receivers.awscontainerinsight component to optimize AWS CloudWatch costs.

I've figured, that it is related to the TTL in the map used to store metric deltas: when the collection interval is more than 5 minutes, collecting deltas breaks because older deltas get removed before new deltas are applied.

Increasing the cleanInterval to 15 minutes helps.

Steps to Reproduce

  1. Create any EKS cluster
  2. Install OTEL to collect AWS Container Insights
  3. Set receivers.awscontainerinsightreceiver.collection_interval to 600s
  4. Restart the daemonset
  5. Wait for 15-20 minutes

Expected Result

Log events in CloudWatch contain CPU usage metrics

Actual Result

Log events in CloudWatch do not contain CPU usage metrics

Collector version

0.41.1

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

extensions:
    health_check:

 receivers:
   awscontainerinsightreceiver:
     collection_interval: 600s

 processors:
   batch/metrics:
     timeout: 60s

   exporters:
      awsemf:
        namespace: ContainerInsights
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{NodeName}'
        resource_to_telemetry_conversion:
          enabled: true
        dimension_rollup_option: NoDimensionRollup
        parse_json_encoded_attr_values: [Sources, kubernetes]
        metric_declarations:
          # cluster metrics
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - cluster_node_count
              - cluster_failed_node_count

    service:
      pipelines:
        metrics:
          receivers: [awscontainerinsightreceiver]
          processors: [batch/metrics]
          exporters: [awsemf]

      extensions: [health_check]

Log output

No response

Additional context

Log event with collection_interval == 600s:

{
    "AutoScalingGroupName": "eks-agent-ng-arm64-4ac815a7-3a71-20b4-a604-aa35acfabcd4",
    "ClusterName": "cluster-with-agent",
    "InstanceId": "i-019f99ea685e48c83",
    "InstanceType": "t4g.medium",
    "Namespace": "kube-system",
    "NodeName": "ip-172-31-28-91.eu-north-1.compute.internal",
    "PodName": "aws-node",
    "Sources": [
        "cadvisor",
        "pod",
        "calculated"
    ],
    "Timestamp": "1730302312567",
    "Type": "Container",
    "Version": "0",
    "container_memory_cache": 106377216,
    "container_memory_failcnt": 0,
    "container_memory_mapped_file": 811008,
    "container_memory_max_usage": 160075776,
    "container_memory_rss": 28655616,
    "container_memory_swap": 0,
    "container_memory_usage": 136433664,
    "container_memory_utilization": 1.1755803143695827,
    "container_memory_working_set": 47341568,
    "container_status": "Running",
    "kubernetes": {
        "container_name": "aws-node",
        "containerd": {
            "container_id": "aabb7c4bea02cfe72371bb5a36bbcd23eff478078c6e920b77e1e9e0ade591b9"
        },
        "host": "ip-172-31-28-91.eu-north-1.compute.internal",
        "labels": {
            "app.kubernetes.io/instance": "aws-vpc-cni",
            "app.kubernetes.io/name": "aws-node",
            "controller-revision-hash": "588469c5c6",
            "k8s-app": "aws-node",
            "pod-template-generation": "2"
        },
        "namespace_name": "kube-system",
        "pod_id": "c3476737-e9d4-44cb-a20f-dcb812ac9091",
        "pod_name": "aws-node-wghkn",
        "pod_owners": [
            {
                "owner_kind": "DaemonSet",
                "owner_name": "aws-node"
            }
        ]
    },
    "number_of_container_restarts": 0
}

Log event with the default configuration:

{
    "AutoScalingGroupName": "eks-agent-ng-1ac79c42-2aa5-ff45-0c1e-b03d703c0d47",
    "ClusterName": "cluster-with-agent",
    "InstanceId": "i-0becbf3535f001cb4",
    "InstanceType": "t3.medium",
    "Namespace": "kube-system",
    "NodeName": "ip-172-31-25-41.eu-north-1.compute.internal",
    "PodName": "aws-node",
    "Sources": [
        "cadvisor",
        "pod",
        "calculated"
    ],
    "Timestamp": "1730371819323",
    "Type": "Container",
    "Version": "0",
    "container_cpu_request": 25,
    "container_cpu_usage_system": 1.3264307613654849,
    "container_cpu_usage_total": 2.9252373450029627,
    "container_cpu_usage_user": 1.393591812573864,
    "container_cpu_utilization": 0.14626186725014814,
    "container_memory_cache": 24600576,
    "container_memory_failcnt": 0,
    "container_memory_hierarchical_pgfault": 267.61999880258816,
    "container_memory_hierarchical_pgmajfault": 0,
    "container_memory_mapped_file": 270336,
    "container_memory_max_usage": 56954880,
    "container_memory_pgfault": 267.61999880258816,
    "container_memory_pgmajfault": 0,
    "container_memory_rss": 26337280,
    "container_memory_swap": 0,
    "container_memory_usage": 52269056,
    "container_memory_utilization": 1.1655047122298874,
    "container_memory_working_set": 47063040,
    "container_status": "Running",
    "kubernetes": {
        "container_name": "aws-node",
        "containerd": {
            "container_id": "b038c0f909602224fa9e1b1351379ff2dc48d0de3e96f720ed80316ada28aca2"
        },
        "host": "ip-172-31-25-41.eu-north-1.compute.internal",
        "labels": {
            "app.kubernetes.io/instance": "aws-vpc-cni",
            "app.kubernetes.io/name": "aws-node",
            "controller-revision-hash": "588469c5c6",
            "k8s-app": "aws-node",
            "pod-template-generation": "2"
        },
        "namespace_name": "kube-system",
        "pod_id": "5e453328-d24c-45d8-9451-7274248cd447",
        "pod_name": "aws-node-wt85g",
        "pod_owners": [
            {
                "owner_kind": "DaemonSet",
                "owner_name": "aws-node"
            }
        ]
    },
    "number_of_container_restarts": 0
}
github-actions[bot] commented 3 weeks ago

Pinging code owners: