Open aerfio opened 7 months ago
Update: I've minimized my config to this:
logging {
format = "json"
level = "info"
}
prometheus.operator.servicemonitors "services" {
forward_to = [otelcol.receiver.prometheus.default.receiver]
clustering {
enabled = false
}
}
otelcol.receiver.prometheus "default" {
output {
metrics = [otelcol.exporter.prometheus.prom.input]
}
}
otelcol.exporter.prometheus "prom" {
forward_to = [prometheus.remote_write.prom.receiver]
}
prometheus.remote_write "prom" {
endpoint {
url = "http://kps-prometheus.monitoring:9090/api/v1/write"
}
}
and following grafana-agent values.yaml:
fullnameOverride: graf-ag
crds:
create: false
controller:
type: deployment
agent:
clustering:
enabled: false
configMap:
create: false
name: grafana-agent-config # configmap created manually with config mentioned before
key: config.river
resources:
limits:
cpu: "1"
memory: 3Gi
requests:
cpu: "1"
memory: 3Gi
serviceMonitor:
enabled: true
And I still caught this bug:
Pods did not restart and they do not exceed k8s limits:
Newest grafana-agent version of course:
Update: I've minified the grafana-agent config even further which fixed the situation, metric values look ok:
logging {
format = "json"
level = "info"
}
prometheus.operator.servicemonitors "services" {
forward_to = [prometheus.remote_write.prom.receiver]
clustering {
enabled = false
}
}
prometheus.remote_write "prom" {
endpoint {
url = "http://kps-prometheus.monitoring:9090/api/v1/write"
}
}
In comparison to the from my previous comment I've removed:
EDIT: a bit more context why in the previous example I had a pipeline with otelcol.receiver.prometheus
-> otelcol.exporter.prometheus
, which seemed redundant - in production setup we deploy Grafana Agent into tenent cluster, we scrape metrics using prometheus.operator.servicemonitors
, gather logs and traces and turn all 3 into OTLP format and send it to management cluster, in which we have another Grafana Agent instance, which turns logs and metrics into "native" format and sends them to Loki and Mimir, traces are send without any conversion. That's why I've left otelcol.receiver.prometheus
andotelcol.exporter.prometheus
in the config in my previous comment. prometheus.operator.servicemonitors
+ otelcol.receiver.prometheus
was already there and otelcol.exporter.prometheus
+ prometheus.remote_write
were added so I could push metrics to local Prometheus instance. I just realised today that having both of those components doesnt make any sense. I'll try to retry our tests in following days to ensure that the fact that the river config from this post is 100% not buggy, but for now it seems like it.
Hi there :wave:
On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.
To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)
It's also happening to us, but we're using a different configuration. In our case we're using Grafana Agent Flow in EC2 ASG.
The service runs normally, without spikes and no performance issues (or at least this is what we can observe), but suddenly it starts to spike with nonsense values. We've been and we're still trying to debug what should be causing this issue but we don't understand what can be the issue. Also, I need to mention that we have different AWS accounts and different cases of Grafana Agent running there (2 on EKS, 6 on EC2), and this is the only environment where we're hitting this issue...
We're running this configuration :
logging {
level = "info"
format = "logfmt"
write_to = [loki.write.loki.receiver]
}
prometheus.exporter.unix "default" {
include_exporter_metrics = true
disable_collectors = ["mdadm"]
}
discovery.relabel "relabel" {
targets = discovery.consul.consul.targets
rule {
action = "drop"
regex = "blabla"
source_labels = ["type"]
}
rule {
action = "replace"
source_labels = ["__meta_consul_service_metadata_path"]
regex = "(.+)"
target_label = "__metrics_path__"
}
rule {
action = "replace"
source_labels = ["__meta_consul_service_metadata_type"]
regex = "(.+)"
target_label = "type"
}
rule {
action = "replace"
source_labels = ["__meta_consul_dc"]
regex = "(.+)"
target_label = "datacenter"
}
rule {
action = "replace"
source_labels = ["__meta_consul_node"]
regex = "(.+)"
target_label = "nodename"
}
rule {
action = "replace"
source_labels = ["__meta_consul_service"]
regex = "(.+)"
target_label = "service"
}
rule {
action = "replace"
source_labels = ["__meta_consul_metadata_type"]
regex = "(.+)"
target_label = "type"
}
rule {
action = "replace"
source_labels = ["__meta_consul_metadata_version"]
regex = "(.+)"
target_label = "version"
}
rule {
action = "replace"
source_labels = ["__meta_consul_service_metadata_service"]
regex = "(.+)"
target_label = "svc"
}
rule {
action = "drop"
regex = "blabla"
source_labels = ["__name__"]
}
}
prometheus.remote_write "mimir" {
external_labels = {
sender = "grafana-agent",
}
wal {
truncate_frequency = "15m"
min_keepalive_time = "3m"
max_keepalive_time = "1h"
}
endpoint {
url = "https://mimir-gateway/api/v1/push"
basic_auth {
username = "user"
password = "password"
}
queue_config {
sample_age_limit = "300s"
max_shards = 20
capacity = 4000
max_samples_per_send = 2000
}
write_relabel_config {
action = "keep"
source_labels = ["__name__"]
regex = ".+:.+|_blablabla_.+|"
}
}
}
discovery.consul "consul" {
server = "127.0.0.1:8500"
tags = ["monitoring"]
}
prometheus.scrape "from_consul" {
clustering {
enabled = true
}
targets = discovery.relabel.relabel.output
forward_to = [prometheus.remote_write.mimir.receiver]
job_name = "monitoring"
scrape_interval = "30s"
}
loki.write "loki" {
endpoint {
url = "https://loki-gateway/loki/api/v1/push"
basic_auth {
username = "user"
password = "password"
}
}
external_labels = {
cluster = "cluster",
}
}
One example of what are we getting
This issue has not had any activity in the past 30 days, so the needs-attention
label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention
label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
It's also happening to us, we're using Grafana Alloy v1.1 (Grafana Agent Flow before)
We scrape Kubernetes metrics and send them directly to Mimir. I'm not sure the abnormal values were caused by Alloy or Mimir though. The abnormal behavior seems to happen when our Mimir cluster has an outage, and disappears when the Mimir cluster recover, but we can't get ri
We have the same issue with the next metrics flow:
Prometheus Server remote-write receive →
→ Prometheus Server remote_write send →
→ prometheus.receive_http
→
→ otelcol.receiver.prometheus
→
→ otelcol.processor.transform
→
→ otelcol.exporter.otlphttp
→
→ NewRelic OTLP API metrics endpoint
prometheus.receive_http "api" {
http {
listen_address = "0.0.0.0"
listen_port = 9999
}
forward_to = [otelcol.receiver.prometheus.default.receiver]
}
otelcol.receiver.prometheus "default" {
output {
metrics = [otelcol.processor.transform.default.input]
}
}
otelcol.processor.transform "default" {
metric_statements {
context = "resource"
statements = [
`delete_key(attributes, "service.instance.id")`,
`delete_key(attributes, "service.name")`,
`delete_key(attributes, "net.host.port")`,
`delete_key(attributes, "net.host.name")`,
]
}
metric_statements {
context = "datapoint"
statements = [
`set(attributes["tags.LabelWithDotPrefix"], attributes["LabelWithoutDotPrefix"])`,
`delete_key(attributes, "LabelWithoutDotPrefix")`,
]
}
output {
metrics = [otelcol.exporter.otlphttp.newrelic.input]
}
}
otelcol.exporter.otlphttp "newrelic" {
client {
endpoint = "https://otlp.nr-data.net"
headers = {
"api-key" = "*******",
}
}
}
Metric values seem to be fine for gauge type, but for the counter, they're huge (e.g. node_cpu_seconds_total
).
Metric values are also fine if we send them directly to the NewRelic metrics API with Prometheus Server remote_write.
Metric values are also fine if we send them directly to the NewRelic metrics API with Prometheus Server remote_write.
if you send the metrics with prometheus.remove_write with Alloy in parallel to the otel pipeline, do you get the high values for both?
@wildum thanks for the suggestion. I checked that - metrics are fine this way.
@wildum metrics are also fine if I send them this way, converting to OTLP and then back to Prometheus format:
Prometheus Server remote-write receive →
→ Prometheus Server remote_write send →
→ prometheus.receive_http
→
→ otelcol.receiver.prometheus
→
→ otelcol.processor.transform
→
→ otelcol.exporter.prometheus
→
→ prometheus.remote_write
→
→ NewRelic metrics API endpoint
From otelcol.processor.transform
live debug it shows that node_cpu_seconds_total
metric has the type of GAUGE
:
Metric #9
Descriptor:
-> Name: node_cpu_seconds_total
-> Description:
-> Unit:
-> DataType: Gauge
While node_exporter tells that it's a COUNTER
:
# TYPE node_cpu_seconds_total counter
From this document, I understand that it's expected, but the value I'm getting isn't even close to what I get when I apply rate()
to counter metric.
What's wrong?
As in title, metrics scraped by Grafana Agent sometimes have gigantic values. This issue happens to various metrics, coming from various components implemented mostly in Go and Erlang, happening on various k8s clusters.
At $work we deploy grafana-agent as one of the first steps of k8s cluster creation to scrape metrics, send logs and traces to central monitoring. Recently, we've observed several of our metrics exihibiting bizzare values, like here:
Where the "typical" range for this value is more or less [0, 10^6]. We've also seen "gaps" in metrics right after this spike, like shown ^.
The minimal reproduction setup that's failing looks as follows:
Initial "production" setup where this issue was observed involved 1 more grafana-agent, which acted as grpc server that collected OTLP signals from N clusters, which then sent it to mimir.
To ensure that the source of problem is in grafana-agent (or our config) I've deployed kube-prometheus-stack to affected cluster with following values.yaml:
I've not observed those issues In grafana from kube-prom-stack. I've also checked those values directly on the pods - they're also fine in the source. This issue IS however visible if I disable kube-prom-stack service monitor scraping and push those metrics using Grafana Agent to the
/api/v1/write
prometheus endpoint. This change is config is visible in provided config.Also, this issue is not happening only on 1 particular cluster, we've created a dashboard to roughly showcase when it happens for any cluster in our company:
Other dashboard:
Generally speaking those "gigantic" values are usually near 10^17, close to max int64 value, which is ~9.223372036854776e+18 (I'm not sure whether it's connected, just pointing it out).
Grafana Agent did not restart at all and there haven't been any logs that were connected to this issue at all. All components are also healthy (this screenshot contains more components than in minimal repro, but when I tested that it was also all green):
Provided Grafana Agent config has been heavily inspired by https://github.com/grafana/agent-modules/blob/main/modules/k8s_pods/module.river.
Steps to reproduce
Maybe using provided config might reproduce this issue.
System information
uname -a
on the node: Linux XYZ 5.15.0-101-generic grafana/agent#111-Ubuntu SMP Tue Mar 5 20:16:58 UTzC 2024 x86_64 x86_64 x86_64 GNU/LinuxSoftware version
Grafana Agent v0.40.3
Configuration
Logs
No response