Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory

diranged commented 5 months ago

Component(s)

target allocator

What happened?

(moved from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32747, where I opened this in the wrong place)

What happened?

Description

We are looking into using OTEL to replace our current Prometheus "scraping" based system. The desire is to run OTEL Collectors in a DaemonSet across the cluster, and use a TargetAllocator in per-node mode to pick up all the existing ServiceMonitor/PodMonitor objects and pass out the configs and endpoints.

We had this running on a test cluster with ~8 nodes and it worked fine. We saw the TargetAllocator use ~128Mi of memory and virtually zero CPU, and the configurations it passed out seemed correct. However, as soon as we spun this up on a "small" but "real" cluster (~15 nodes, a few workloads) - we see the targetallocator pods go into a painful loop and use a ton of CPU and memory:

When we look at the logs, the pods are in a loop spewing thousands of lines over and over again like this:

{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}

All of our clusters are generally configured the same - different workloads, but the same kinds of controllers, kubernetes versions, node OS's, etc.

What can I look for to better troubleshoot what might be wrong here?

Steps to Reproduce

## Expected Result We obviously don't expect the TargetAllocator pods to have this loop or be using those kinds of resources on a small cluster. ### Kubernetes Version 1.28 ### Operator version 0.98.0 ### Collector version 0.98.0 ### Environment information ## Environment OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2") ### Log output ```shell {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Starting the Target Allocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Starting server..."} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for namespace"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for namespace"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for servicemonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for servicemonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for podmonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for podmonitors"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"} {"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"} ... ``` ### Additional context ### OpenTelemetry Collector configuration ```yaml apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: otel-collector-agent spec: args: feature-gates: +processor.resourcedetection.hostCPUSteppingAsString config: |- exporters: debug: sampling_initial: 15 sampling_thereafter: 60 debug/verbose: sampling_initial: 15 sampling_thereafter: 60 verbosity: detailed otlp/metrics: endpoint: 'otel-collector-metrics-collector:4317' tls: ca_file: /tls/ca.crt cert_file: /tls/tls.crt key_file: /tls/tls.key extensions: health_check: endpoint: 0.0.0.0:13133 pprof: endpoint: :1777 processors: k8sattributes:... receivers: hostmetrics: collection_interval: 10s root_path: /hostfs scrapers: cpu: metrics: system.cpu.frequency: enabled: true system.cpu.logical.count: enabled: true system.cpu.physical.count: enabled: true system.cpu.utilization: enabled: true disk: {} filesystem: exclude_fs_types: fs_types: - autofs - binfmt_misc - bpf - cgroup2 - configfs - debugfs - devpts - fusectl - hugetlbfs - iso9660 - mqueue - nsfs - proc - procfs - pstore - rpc_pipefs - securityfs - selinuxfs - squashfs - sysfs - tracefs match_type: strict exclude_mount_points: match_type: regexp mount_points: - /dev/* - /proc/* - /sys/* - /run/k3s/containerd/* - /var/lib/docker/* - /var/lib/kubelet/* - /snap/* metrics: system.filesystem.utilization: enabled: true load: {} memory: metrics: system.memory.utilization: enabled: true network: metrics: system.network.conntrack.count: enabled: true system.network.conntrack.max: enabled: true paging: metrics: system.paging.utilization: enabled: true processes: {} kubeletstats: auth_type: serviceAccount collection_interval: 15s endpoint: https://${env:KUBE_NODE_NAME}:10250 insecure_skip_verify: true prometheus: config: scrape_configs: - job_name: otel-collector report_extra_scrape_metrics: true service: extensions: - health_check - pprof pipelines:... telemetry: logs: level: 'info' deploymentUpdateStrategy: {} env: - name: KUBE_NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: OTEL_RESOURCE_ATTRIBUTES value: node.name=$(KUBE_NODE_NAME) image: otel/opentelemetry-collector-contrib:0.98.0 ingress: route: {} livenessProbe: failureThreshold: 3 initialDelaySeconds: 60 periodSeconds: 30 managementState: managed mode: daemonset nodeSelector: kubernetes.io/os: linux observability: metrics: enableMetrics: true podDisruptionBudget: maxUnavailable: 1 podSecurityContext: runAsGroup: 0 runAsUser: 0 priorityClassName: otel-collector replicas: 1 resources: limits: memory: 1Gi requests: cpu: 100m memory: 192Mi securityContext: capabilities: add: - SYS_PTRACE shareProcessNamespace: true targetAllocator: allocationStrategy: per-node enabled: true filterStrategy: relabel-config image: otel/target-allocator:0.98.0 observability: metrics: {} prometheusCR: enabled: true scrapeInterval: 1m0s replicas: 2 resources: limits: memory: 4Gi requests: cpu: 100m memory: 2Gi tolerations: - effect: NoSchedule operator: Exists - effect: NoExecute operator: Exists updateStrategy: rollingUpdate: maxUnavailable: 10% type: RollingUpdate upgradeStrategy: automatic volumeMounts: - mountPath: /hostfs mountPropagation: HostToContainer name: hostfs readOnly: true - mountPath: /etc/passwd name: etc-passwd readOnly: true - mountPath: /hostfs/var/cache name: host-var-cache - mountPath: /hostfs/run name: host-run - mountPath: /tls/ca.crt name: tls-ca readOnly: true subPath: ca.crt - mountPath: /tls/tls.key name: tls readOnly: true subPath: tls.key - mountPath: /tls/tls.crt name: tls readOnly: true subPath: tls.crt volumes: - hostPath: path: / name: hostfs - hostPath: path: /etc/passwd name: etc-passwd - hostPath: path: /run type: DirectoryOrCreate name: host-run - hostPath: path: /var/cache type: DirectoryOrCreate name: host-var-cache - name: tls-ca secret: defaultMode: 420 items: - key: ca.crt path: ca.crt secretName: otel-collector-issuer - name: tls secret: defaultMode: 420 items: - key: tls.crt path: tls.crt - key: tls.key path: tls.key secretName: otel-collector-agent ```

jaronoff97 commented 5 months ago

thank for moving this over :D per my comment, i'm wondering if there's some type of leak in the node strategy causing this high usage. This is certainly MUCH higher than I would anticipate. Are you able to take a profile to share? If not, I can attempt to repro, but my backlog is pretty huge rn.

jaronoff97 commented 5 months ago

my bet is that restarting the watch routine is causing a ton of memory churn...

diranged commented 5 months ago

I am happy to get a profile - if you can tell me how to do that?

diranged commented 5 months ago

@jaronoff97, Just looking at the code that throws the error, I don't understand what it's point is... every cluster is going to have Pods where .spec.nodeName isn't filled in ... pods that are Pending for example. It feels like the operator is giving up and "trying again" when it sees that. Am I reading this wrong?

https://github.com/open-telemetry/opentelemetry-operator/blob/48bc207fa520874491b599815e8d6eb463e92314/cmd/otel-allocator/collector/collector.go#L134-L137

jaronoff97 commented 5 months ago

I don't think we should be restarting the watch routine as that feels unnecessary, but also we can't allocate a target for a pod that we know cannot be scraped yet.

You can follow the steps under the "Collecting and Analyzing Profiles " header in this doc where the port is whatever port your TA pod has exposed.

diranged commented 5 months ago

For others info - I have sent @jaronoff97 CPU and Memory profiles on Slack.

swiatekm commented 5 months ago

my bet is that restarting the watch routine is causing a ton of memory churn...

I don't think we ever close this watcher, actually. So if it keeps getting restarted for some reason, we get a memory/goroutine leak.

https://github.com/open-telemetry/opentelemetry-operator/pull/2528 should fix this by using standard informers.

jaronoff97 commented 5 months ago

merged ^^ i'm hoping that helps out! Please let me know if it doesn't

diranged commented 5 months ago

@jaronoff97 Awesome! Do you guys push a 'main' or 'latest' build to Docker? I can try that out and let you know if it helps..

jaronoff97 commented 5 months ago

yep, we do push a main :)

diranged commented 5 months ago

Ok - Good news / Bad news...

Good News: I am not seeing the problem anymore with the new release... it seems to work. Bad News: Whatever state our cluster or pods were in that were triggering the bug seems to be resolved now - I ran the old release (0.98.0) first and it also did not exhibit the bug.

I'm doing some cluster rolls to see if I can trigger the behavior.. otherwise I'll just have to test by moving to larger clusters./

diranged commented 5 months ago

Ok.. more good news.. I was able to go back and reproduce the problem with the old code just by rolling the cluster (which puts pods into a Pending state). Going and rolling to the newest codebase immediately fixes it.

open-telemetry / opentelemetry-operator