Closed diranged closed 5 months ago
thank for moving this over :D per my comment, i'm wondering if there's some type of leak in the node strategy causing this high usage. This is certainly MUCH higher than I would anticipate. Are you able to take a profile to share? If not, I can attempt to repro, but my backlog is pretty huge rn.
my bet is that restarting the watch routine is causing a ton of memory churn...
I am happy to get a profile - if you can tell me how to do that?
@jaronoff97,
Just looking at the code that throws the error, I don't understand what it's point is... every cluster is going to have Pods where .spec.nodeName
isn't filled in ... pods that are Pending
for example. It feels like the operator is giving up and "trying again" when it sees that. Am I reading this wrong?
I don't think we should be restarting the watch routine as that feels unnecessary, but also we can't allocate a target for a pod that we know cannot be scraped yet.
You can follow the steps under the "Collecting and Analyzing Profiles " header in this doc where the port is whatever port your TA pod has exposed.
For others info - I have sent @jaronoff97 CPU and Memory profiles on Slack.
my bet is that restarting the watch routine is causing a ton of memory churn...
I don't think we ever close this watcher, actually. So if it keeps getting restarted for some reason, we get a memory/goroutine leak.
https://github.com/open-telemetry/opentelemetry-operator/pull/2528 should fix this by using standard informers.
merged ^^ i'm hoping that helps out! Please let me know if it doesn't
@jaronoff97 Awesome! Do you guys push a 'main' or 'latest' build to Docker? I can try that out and let you know if it helps..
yep, we do push a main :)
Ok - Good news / Bad news...
Good News: I am not seeing the problem anymore with the new release... it seems to work. Bad News: Whatever state our cluster or pods were in that were triggering the bug seems to be resolved now - I ran the old release (0.98.0) first and it also did not exhibit the bug.
I'm doing some cluster rolls to see if I can trigger the behavior.. otherwise I'll just have to test by moving to larger clusters./
Ok.. more good news.. I was able to go back and reproduce the problem with the old code just by rolling the cluster (which puts pods into a Pending state). Going and rolling to the newest codebase immediately fixes it.
Component(s)
target allocator
What happened?
(moved from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32747, where I opened this in the wrong place)
What happened?
Description
We are looking into using OTEL to replace our current Prometheus "scraping" based system. The desire is to run OTEL Collectors in a DaemonSet across the cluster, and use a TargetAllocator in
per-node
mode to pick up all the existing ServiceMonitor/PodMonitor objects and pass out the configs and endpoints.We had this running on a test cluster with ~8 nodes and it worked fine. We saw the TargetAllocator use ~128Mi of memory and virtually zero CPU, and the configurations it passed out seemed correct. However, as soon as we spun this up on a "small" but "real" cluster (~15 nodes, a few workloads) - we see the
targetallocator
pods go into a painful loop and use a ton of CPU and memory:When we look at the logs, the pods are in a loop spewing thousands of lines over and over again like this:
All of our clusters are generally configured the same - different workloads, but the same kinds of controllers, kubernetes versions, node OS's, etc.
What can I look for to better troubleshoot what might be wrong here?
Steps to Reproduce