open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.99k stars 2.32k forks source link

Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #32747

Closed diranged closed 5 months ago

diranged commented 5 months ago

Component(s)

cmd/otelcontribcol

What happened?

Description

We are looking into using OTEL to replace our current Prometheus "scraping" based system. The desire is to run OTEL Collectors in a DaemonSet across the cluster, and use a TargetAllocator in per-node mode to pick up all the existing ServiceMonitor/PodMonitor objects and pass out the configs and endpoints.

We had this running on a test cluster with ~8 nodes and it worked fine. We saw the TargetAllocator use ~128Mi of memory and virtually zero CPU, and the configurations it passed out seemed correct. However, as soon as we spun this up on a "small" but "real" cluster (~15 nodes, a few workloads) - we see the targetallocator pods go into a painful loop and use a ton of CPU and memory:

image image image

When we look at the logs, the pods are in a loop spewing thousands of lines over and over again like this:

{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}

All of our clusters are generally configured the same - different workloads, but the same kinds of controllers, kubernetes versions, node OS's, etc.

What can I look for to better troubleshoot what might be wrong here?

Steps to Reproduce

## Expected Result We obviously don't expect the TargetAllocator pods to have this loop or be using those kinds of resources on a small cluster. ### Collector version 0.98.0 ### Environment information ## Environment OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2") ### OpenTelemetry Collector configuration ```yaml apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: otel-collector-agent spec: args: feature-gates: +processor.resourcedetection.hostCPUSteppingAsString config: |- exporters: debug: sampling_initial: 15 sampling_thereafter: 60 debug/verbose: sampling_initial: 15 sampling_thereafter: 60 verbosity: detailed otlp/metrics: endpoint: 'otel-collector-metrics-collector:4317' tls: ca_file: /tls/ca.crt cert_file: /tls/tls.crt key_file: /tls/tls.key extensions: health_check: endpoint: 0.0.0.0:13133 pprof: endpoint: :1777 processors: k8sattributes:... receivers: hostmetrics: collection_interval: 10s root_path: /hostfs scrapers: cpu: metrics: system.cpu.frequency: enabled: true system.cpu.logical.count: enabled: true system.cpu.physical.count: enabled: true system.cpu.utilization: enabled: true disk: {} filesystem: exclude_fs_types: fs_types: - autofs - binfmt_misc - bpf - cgroup2 - configfs - debugfs - devpts - fusectl - hugetlbfs - iso9660 - mqueue - nsfs - proc - procfs - pstore - rpc_pipefs - securityfs - selinuxfs - squashfs - sysfs - tracefs match_type: strict exclude_mount_points: match_type: regexp mount_points: - /dev/* - /proc/* - /sys/* - /run/k3s/containerd/* - /var/lib/docker/* - /var/lib/kubelet/* - /snap/* metrics: system.filesystem.utilization: enabled: true load: {} memory: metrics: system.memory.utilization: enabled: true network: metrics: system.network.conntrack.count: enabled: true system.network.conntrack.max: enabled: true paging: metrics: system.paging.utilization: enabled: true processes: {} kubeletstats: auth_type: serviceAccount collection_interval: 15s endpoint: https://${env:KUBE_NODE_NAME}:10250 insecure_skip_verify: true prometheus: config: scrape_configs: - job_name: otel-collector report_extra_scrape_metrics: true service: extensions: - health_check - pprof pipelines:... telemetry: logs: level: 'info' deploymentUpdateStrategy: {} env: - name: KUBE_NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: OTEL_RESOURCE_ATTRIBUTES value: node.name=$(KUBE_NODE_NAME) image: otel/opentelemetry-collector-contrib:0.98.0 ingress: route: {} livenessProbe: failureThreshold: 3 initialDelaySeconds: 60 periodSeconds: 30 managementState: managed mode: daemonset nodeSelector: kubernetes.io/os: linux observability: metrics: enableMetrics: true podDisruptionBudget: maxUnavailable: 1 podSecurityContext: runAsGroup: 0 runAsUser: 0 priorityClassName: otel-collector replicas: 1 resources: limits: memory: 1Gi requests: cpu: 100m memory: 192Mi securityContext: capabilities: add: - SYS_PTRACE shareProcessNamespace: true targetAllocator: allocationStrategy: per-node enabled: true filterStrategy: relabel-config image: otel/target-allocator:0.98.0 observability: metrics: {} prometheusCR: enabled: true scrapeInterval: 1m0s replicas: 2 resources: limits: memory: 4Gi requests: cpu: 100m memory: 2Gi tolerations: - effect: NoSchedule operator: Exists - effect: NoExecute operator: Exists updateStrategy: rollingUpdate: maxUnavailable: 10% type: RollingUpdate upgradeStrategy: automatic volumeMounts: - mountPath: /hostfs mountPropagation: HostToContainer name: hostfs readOnly: true - mountPath: /etc/passwd name: etc-passwd readOnly: true - mountPath: /hostfs/var/cache name: host-var-cache - mountPath: /hostfs/run name: host-run - mountPath: /tls/ca.crt name: tls-ca readOnly: true subPath: ca.crt - mountPath: /tls/tls.key name: tls readOnly: true subPath: tls.key - mountPath: /tls/tls.crt name: tls readOnly: true subPath: tls.crt volumes: - hostPath: path: / name: hostfs - hostPath: path: /etc/passwd name: etc-passwd - hostPath: path: /run type: DirectoryOrCreate name: host-run - hostPath: path: /var/cache type: DirectoryOrCreate name: host-var-cache - name: tls-ca secret: defaultMode: 420 items: - key: ca.crt path: ca.crt secretName: otel-collector-issuer - name: tls secret: defaultMode: 420 items: - key: tls.crt path: tls.crt - key: tls.key path: tls.key secretName: otel-collector-agent ``` ### Log output ```shell {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Starting the Target Allocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Starting server..."} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for namespace"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for namespace"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for servicemonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for servicemonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for podmonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for podmonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} ... ``` ``` ### Additional context _No response_
jaronoff97 commented 5 months ago

We should probably transfer this issue to the operator group cc @TylerHelmuth

Have you tried using a different allocation strategy for now? I'm wondering if this is related to the node strategy @matej-g

diranged commented 5 months ago

@jaronoff97, First - darn, I totally meant to put this into the operator repo ... I will move the issue and close this out. Second, we can't use any allocation strategy other than per-node because our entire goal is to run the collectors as a DaemonSet and have them only collect metrics from their local nodes.

diranged commented 5 months ago

New issue opened at https://github.com/open-telemetry/opentelemetry-operator/issues/2916 .. closing this one out.