Closed diranged closed 5 months ago
We should probably transfer this issue to the operator group cc @TylerHelmuth
Have you tried using a different allocation strategy for now? I'm wondering if this is related to the node strategy @matej-g
@jaronoff97,
First - darn, I totally meant to put this into the operator repo ... I will move the issue and close this out. Second, we can't use any allocation strategy other than per-node
because our entire goal is to run the collectors as a DaemonSet and have them only collect metrics from their local nodes.
New issue opened at https://github.com/open-telemetry/opentelemetry-operator/issues/2916 .. closing this one out.
Component(s)
cmd/otelcontribcol
What happened?
Description
We are looking into using OTEL to replace our current Prometheus "scraping" based system. The desire is to run OTEL Collectors in a DaemonSet across the cluster, and use a TargetAllocator in
per-node
mode to pick up all the existing ServiceMonitor/PodMonitor objects and pass out the configs and endpoints.We had this running on a test cluster with ~8 nodes and it worked fine. We saw the TargetAllocator use ~128Mi of memory and virtually zero CPU, and the configurations it passed out seemed correct. However, as soon as we spun this up on a "small" but "real" cluster (~15 nodes, a few workloads) - we see the
targetallocator
pods go into a painful loop and use a ton of CPU and memory:When we look at the logs, the pods are in a loop spewing thousands of lines over and over again like this:
All of our clusters are generally configured the same - different workloads, but the same kinds of controllers, kubernetes versions, node OS's, etc.
What can I look for to better troubleshoot what might be wrong here?
Steps to Reproduce