Grafana agent operator replication doesn't work in case of metrics

Hello team , Please help with the below issue Summary: Grafana agent replication doesn't work Description: I added some change at k8s level which cause grafana agent to restart so when the first replica went down I see that we are having metrics loss and the second replica does not pick up the work . To test this I added a node selector on grafana agent to make sure that one pod is down for a longer period of time and I made sure that the node group doesn't have enough memory . So what happened is that replica-1 went down but replica 0 was running .When I checked in grafana I see that we are facing a loss in metrics because grafana agent replica-0 is not scraping the metrics . Below are the queries I used to verify the same . The below query will give me the jobs that are missing now but were present 12 hours earlier .Also verified the same with absent queries . (sum by(job)(scrape_samples_scraped offset 12h )) unless (sum by(job)(scrape_samples_scraped ) ) Steps to reproduce :

Spin up grafana agent with replication 2 and scrape some metrics.
Bring down one pod and then query grafana with the above query .You will see that some metrics are missing even though the replica is running.

Grafana Agent image : grafana/agent:v0.30.2 Note : We are using grafana agent operator and all the crds like Metrics instance etc.

hey @amanpruthi ! It is unclear to me whether your grafana-agent instances purposed for metrics are actually specifying the __replica__ label necessary for HA tracker to failover to the other replica. You could check this is the case or not via the /distributor/ha_tracker endpoint in distributors (port-forwarding or using an ingress if you have access to it).

We suffered a similar problem, but in our case, the problem was that even after enabling HA tracking properly, the fact that a single cluster label is used for all the shards, turns all HA shards into one (when actually, you would want to have as many HA pairs for the clusters as shards, so as to not have any metric loss / data gap).

We are considering to use a different, meta cluster (say __cluster__ label, to be dropped after ingestion) that we can use to identify each of the remote write HA pairs we want to stablish (in the cases we had a cluster where we need more than one shard to be able to scale)

@dgonzalezruiz honestly i believe HA isn't supported right now by Grafana Agent. What you would need is support for running multiple replica's (even when deployment is set to daemonset) and a given replica should set a replica label so the backend (Cortex/Mimir) can properly handle HA.

When enabling clustering it only divides the load within a single replica...

What we would need is support similar to Prometheus fields: prometheusExternalLabelName and replicaExternalLabelName:


DESCRIPTION:
    Name of Prometheus external label used to denote the Prometheus instance
    name. The external label will _not_ be added when the field is set to the
    empty string (`""`). 
     Default: "prometheus"

FIELD: replicaExternalLabelName <string>

DESCRIPTION:
    Name of Prometheus external label used to denote the replica name. The
    external label will _not_ be added when the field is set to the empty string
    (`""`). 
     Default: "prometheus_replica"

For what it is worth, I ended up being able to solve this issue by changing the cluster label used by the remote Cortex/Mimir cluster, to another "meta-value", hidden label that I would also set to remove by distributors before ingestion, which I named cluster (keeping existing, normal cluster label for metric topology/querying)

Then, I set each Grafana shard to use its shard value as replica; this allowed for any HA-pair (replica pair of Shards, or normal single Prometheus) remote writer set up in my org, to be received as an unique pair, hence allowing to fail over to each of the shards' replica metrics in the normal HA manner.

I would say unless replication for Grafana shard agents is set that way, it is completely useless for HA purposes. Hope this helps someone.

El mié, 1 nov 2023, 11:50, paul-bormans @.***> escribió:

@dgonzalezruiz https://github.com/dgonzalezruiz honestly i believe HA isn't supported right now by Grafana Agent. What you would need is support for running multiple replica's (even when deployment is set to daemonset) and a given replica should set a replica label so the backend (Cortex/Mimir) can properly handle HA.

When enabling clustering it only divides the load within a single replica...

What we would need is support similar to Prometheus fields: prometheusExternalLabelName and replicaExternalLabelName:

`FIELD: prometheusExternalLabelName

DESCRIPTION: Name of Prometheus external label used to denote the Prometheus instance name. The external label will not be added when the field is set to the empty string (""). Default: "prometheus"

FIELD: replicaExternalLabelName

DESCRIPTION: Name of Prometheus external label used to denote the replica name. The external label will not be added when the field is set to the empty string (""). Default: "prometheus_replica" `

— Reply to this email directly, view it on GitHub https://github.com/grafana/agent/issues/3562#issuecomment-1788757302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3FTU6DJJ4MPJMUIEWLKPTYCISQDAVCNFSM6AAAAAAXCEV77CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYG42TOMZQGI . You are receiving this because you were mentioned.Message ID: @.***>

grafana / agent

Grafana agent operator replication doesn't work in case of metrics #3562