Open alecrajeev opened 8 months ago
Thanks for the proposal! What you propose here broadly makes sense.
IIRC, Loki/Mimir have (had?) a flag to specify a wait time before participating in workload distribution.
I'm worried about a scenario where the min cluster size will never be reached; in that case it might make sense to have a second flag to control wait time:
--cluster.min-cluster-size # Minimum cluster size before participating in workload distribution
--cluster.participation-wait-time # Minimum time to wait before participating in workload distribution
Here, the idea is that you could set neither, one, or both to get different effects. If both are set, the first one that succeeds will force distribution to activate for that node, so --cluster.min-cluster-size=5 --cluster.participation-wait-time=5m
will either wait for 5 minutes to pass or for there to be 5 members in the cluster.
I really like your idea. I hadn't thought about the case when there are not enough members in the cluster, so I think adding those two flags would be a good improvement. Also thanks for transferring this to the new alloy repo.
It would be nice to be able to do the same thing for prometheus.receive_http, We have a situation now where edge Alloy instances are using a central cluster of Alloy instances as a metric pipeline by using prometheus.remote_write on the edge to prometheus.receive_http in the cluster and whenever the cluster has to restart the first pod to come back up gets annihalated by remote_write requests the second the API comes up. Having some sort of mechanism to only bring that service up once the cluster has established quorum would be handy.
this causes us quite a lot of grief with agents crash looping and never actually managing to form a cluster, because they try to scrape thousands of targets when they start up rather than joining the cluster, waiting for the shard allocation, and then starting scraping. Even just --delay-initial-scrape=60s
or something would suffice
edit: it also seems reasonable to try to hard-cap the amount of memory being consumed at any time to prevent OOMs? that would also give you a metric signal to say that scrape intervals cannot be honoured, and more replicas need to be added?
@GroovyCarrot regarding your second point, there's GOMEMLIMIT support from v1.1.0.
As a general question about this proposal: what should be happening when the workload distribution is not yet allowed due to these newly introduced conditions? Should all the components be disabled and not running? Or should only the components supporting the clustering be waiting / not doing any work?
It seems that the most important thing is the observable behaviour here: metrics shouldn't be collected until the cluster is "ready" (minimum cluster size or timeout waiting for a minimum cluster size). That makes me feel like it's OK if these components are technically running as long as they're not doing any work.
The easiest way of us implementing that would be to cause Lookup to fail or return no nodes if the cluster isn't "ready," since Lookup is used for work distribution. If we did this, we'd have to update component logic using Lookup, since most calls to Lookup treat errors or a length of zero owners as being self-owned.
Earlier today in a call I suggested that we may be able to use ckit's concept of "viewer" nodes (read-only nodes that participate in gossip but not in workload distribution) to implement this behaviour, where a node only transitions to "participant" once the cluster is ready. However, since gossip is eventually consistent, there's still a chance one node can briefly see itself as the only participant and over-assign work to itself. We might be able to address this edge case with ckit-level support for cluster readiness.
It seems like changing the behaviour of Lookup would be the easiest and least error-prone way of implementing this. Disabling components while the cluster isn't ready should work too, but is probably going to be harder to implement.
@thampiotr we had set GOMEMLIMIT, but it seemed that it was because alloy uses informers to load the cluster state in, there's a certain amount of base memory that all the agents need that will increase with a growing cluster. Setting GOMEMLIMIT we found does help a lot, but ultimately we just had to give it more base memory.
We do still sometimes have the issue of clusters that are overcommitted on the number of targets they have to scrape, but if we scale the deployment down to 0, and then back up to more than necessary, it tends to lets them all start up simultaneously and form a cluster before they get overwhelmed - and then the autoscaling can bring the number down gradually.
As I mentioned, it would be ideal if an instance was unable to meet the cluster demands based on it's resource allocation, there would be some sort of signal that could be scaled on, rather than having to deal with the OOMs and crashlooping.
Background
In clusters with several agent pods and many potential targets, it can take a few minutes for a cluster to have all of its pods join and become healthy on startup. I noticed that when a pod starts up that is intended to be part of a larger cluster, it takes a few minutes to join. During this time it will briefly attempt to scrape a large number of targets and run into OOM errors.
This is more apparent in a statefulset with OrderedReady enabled, where the first pod will have all potential targets. Then as additional pods get spun up, the number of scrape targets will be distributed evenly among them. We can know ahead time that a cluster will have at least say 5 pods.
What I have found as a work around, is to launch the pods with a config that has several
discovery
components to find all the potential targets, but no scrape component. Then once all 5 pods are running and part of the cluster, then I update the config to add the scrape component.Also when the statefulset is changed from OrderedReady to Parallel, then all the pods launch at the same time. But what I found is there is an initial burst in 409 conflict errors from
thanos-receive
because the pods have not joined the cluster yet. Once they have joined, this goes away.Proposal
I propose a new flag to grafana-agent run which specifies the minimum number of agents in the cluster before sending any metrics:
This will lead to a delay in sending metrics when first starting up because it takes a few minutes for a large cluster to become healthy and have its agents register themselves. However, this opt-in feature can be helpful for stability purposes. This can prevent the first grafana agent pod in a statefulset from running into an OOM error and dramatically cut down the number of 409 conflict errors from
thanos-receive
.