grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.48k stars 223 forks source link

Metrics scraped by Grafana Agent sometimes exhibit abnormally large values #188

Open aerfio opened 7 months ago

aerfio commented 7 months ago

What's wrong?

As in title, metrics scraped by Grafana Agent sometimes have gigantic values. This issue happens to various metrics, coming from various components implemented mostly in Go and Erlang, happening on various k8s clusters.

At $work we deploy grafana-agent as one of the first steps of k8s cluster creation to scrape metrics, send logs and traces to central monitoring. Recently, we've observed several of our metrics exihibiting bizzare values, like here:

image

Where the "typical" range for this value is more or less [0, 10^6]. We've also seen "gaps" in metrics right after this spike, like shown ^.

The minimal reproduction setup that's failing looks as follows:

flowchart TD
    A[Pod] -->|scraped by| B[Grafana Agent]
    B --> C[Prometheus]

Initial "production" setup where this issue was observed involved 1 more grafana-agent, which acted as grpc server that collected OTLP signals from N clusters, which then sent it to mimir.

To ensure that the source of problem is in grafana-agent (or our config) I've deployed kube-prometheus-stack to affected cluster with following values.yaml:

defaultRules:
  create: false
grafana:
  enabled: true
  forceDeployDashboards: false
crds:
  enabled: false
kubeStateMetrics:
  enabled: false
nodeExporter:
  enabled: false
windowsMonitoring:
  enabled: false
alertmanager:
  enabled: false
coreDns:
  enabled: false
kubeEtcd:
  enabled: false
kubeProxy:
  enabled: false
kubeScheduler:
  enabled: false
kubelet:
  enabled: false
kubeControllerManager:
  enabled: false
kubeApiServer:
  enabled: false
prometheus:
  prometheusSpec:
    nodeSelector:
      kubernetes.io/hostname: redacted # I tired 3 different nodes to be sure
    enableAdminAPI: true
    enableRemoteWriteReceiver: true
    retention: 7d
    scrapeInterval: 30s
    logLevel: warn
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false
    probeSelectorNilUsesHelmValues: false
    scrapeConfigSelectorNilUsesHelmValues: false
    storageSpec:
      disableMountSubPath: true
      volumeClaimTemplate:
        spec:
          storageClassName: redacted
          resources:
            requests:
              storage: 30Gi

I've not observed those issues In grafana from kube-prom-stack. I've also checked those values directly on the pods - they're also fine in the source. This issue IS however visible if I disable kube-prom-stack service monitor scraping and push those metrics using Grafana Agent to the /api/v1/write prometheus endpoint. This change is config is visible in provided config.

Also, this issue is not happening only on 1 particular cluster, we've created a dashboard to roughly showcase when it happens for any cluster in our company:

image

Other dashboard:

image

Generally speaking those "gigantic" values are usually near 10^17, close to max int64 value, which is ~9.223372036854776e+18 (I'm not sure whether it's connected, just pointing it out).

Grafana Agent did not restart at all and there haven't been any logs that were connected to this issue at all. All components are also healthy (this screenshot contains more components than in minimal repro, but when I tested that it was also all green):

image

Provided Grafana Agent config has been heavily inspired by https://github.com/grafana/agent-modules/blob/main/modules/k8s_pods/module.river.

Steps to reproduce

Maybe using provided config might reproduce this issue.

System information

uname -a on the node: Linux XYZ 5.15.0-101-generic grafana/agent#111-Ubuntu SMP Tue Mar 5 20:16:58 UTzC 2024 x86_64 x86_64 x86_64 GNU/Linux

Software version

Grafana Agent v0.40.3

Configuration

logging {
  format = "json"
  level  = "info"
}
prometheus.operator.servicemonitors "services" {
    forward_to = [otelcol.receiver.prometheus.default.receiver]
    clustering {
        enabled = true
    }
    rule {
        target_label = "k8s_cluster_name"
        replacement = "XYZ"
    }
    rule {
        source_labels = [
            "__meta_kubernetes_pod_label_app_kubernetes_io_instance",
            "__meta_kubernetes_pod_label_app_kubernetes_io_name",
        ]
        target_label = "__helm_name__"
        separator    = "-"
        regex        = "(.+-.+)"
    }
    rule {
        // Try to identify a service name to eventually form the job label. We'll
        // prefer the first of the below labels, in descending order.
        source_labels = [
            "__meta_kubernetes_pod_label_k8s_app",
            "__meta_kubernetes_pod_label_app",
            "__meta_kubernetes_pod_label_name",
            "__helm_name__",
            "__meta_kubernetes_pod_controller_name",
            "__meta_kubernetes_pod_name",
        ]
        target_label = "__service__"
        // Our in-memory string will be something like A;B;C;D;E;F, where any of the
        // letters could be replaced with a label value or be empty if the label
        // value did not exist.
        //
        // We want to match for the very first sequence of non-semicolon characters
        // which is either prefaced by zero or more semicolons, and is followed by
        // zero or more semicolons before the rest of the string.
        //
        // This is a very annoying way of being able to do conditionals, and
        // ideally we can use River expressions in the future to make this much
        // less bizarre.
        regex = ";*([^;]+);*.*"
    }
    rule {
        source_labels = ["__meta_kubernetes_pod_node_name"]
        target_label  = "__host__"
    }
    rule {
        source_labels = [
            "__meta_kubernetes_namespace",
            "__service__",
        ]
        target_label = "job"
        separator    = "/"
    }
    rule {
        source_labels = ["__meta_kubernetes_namespace"]
        target_label  = "namespace"
    }
    rule {
        source_labels = ["__meta_kubernetes_pod_name"]
        target_label  = "pod"
    }
    rule {
        source_labels = ["__meta_kubernetes_pod_container_name"]
        target_label  = "container"
    }
    rule {
        source_labels = ["__meta_kubernetes_pod_label_app"]
        target_label  = "app"
    }
    rule {
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
        target_label  = "name"
    }
}
module.git "k8s_api" {
  repository = "https://github.com/grafana/agent-modules.git"
  revision   = "65de0463c41f9027608132ec11c2f7ed411eb107"
  path       = "modules/k8s_api/module.river"
  arguments {
    forward_metrics_to = [otelcol.receiver.prometheus.default.receiver]
  }
}
otelcol.receiver.prometheus "default" {
    output {
        metrics = [otelcol.processor.k8sattributes.default.input]
    }
}
otelcol.processor.k8sattributes "default" {
    extract {
        label {
            from      = "pod"
            key_regex = "(.*)/(.*)"
            tag_name  = "$1.$2"
        }
        metadata = [
            "k8s.pod.name",
            "k8s.pod.uid",
            "k8s.deployment.name",
            "k8s.node.name",
            "k8s.namespace.name",
            "k8s.pod.start_time",
            "k8s.replicaset.name",
            "k8s.replicaset.uid",
            "k8s.daemonset.name",
            "k8s.daemonset.uid",
            "k8s.job.name",
            "k8s.job.uid",
            "k8s.cronjob.name",
            "k8s.statefulset.name",
            "k8s.statefulset.uid",
            "k8s.container.name",
            "container.image.name",
            "container.image.tag",
            "container.id",
        ]
    }
    output {
        traces  = [otelcol.processor.transform.add_cluster_name.input]
        logs    = [otelcol.processor.transform.add_cluster_name.input]
        metrics = [otelcol.processor.transform.add_cluster_name.input]
    }
}

otelcol.processor.transform "add_cluster_name" {
  error_mode = "ignore"
  metric_statements {
      context = "datapoint"
      statements = [
        "set(attributes[\"k8s.cluster.name\"], \"XYZ\")",
      ]
  }
  trace_statements {
      context = "resource"
      statements = [
        "set(attributes[\"k8s.cluster.name\"], \"XYZ\")",
      ]
  }
  log_statements {
      context = "resource"
      statements = [
        "set(attributes[\"k8s.cluster.name\"], \"XYZ\")",
      ]
  }
  output {
      metrics = [otelcol.processor.batch.default.input]
      traces  = [otelcol.processor.batch.default.input]
      logs    = [otelcol.processor.batch.default.input]
  }
}
otelcol.processor.batch "default" {
    timeout = "5s"
    send_batch_size = 4096
    send_batch_max_size = 8192
    output {
        metrics = [otelcol.exporter.otlp.default.input,otelcol.exporter.prometheus.prom.input]
        logs    = [otelcol.exporter.otlp.default.input]
        traces  = [otelcol.exporter.otlp.default.input]
    }
}
otelcol.exporter.otlp "default" {
    client {
        endpoint = "**redacted**"
        auth = otelcol.auth.basic.default.handler
    }
}
otelcol.exporter.prometheus "prom" {
    forward_to = [prometheus.remote_write.prom.receiver]
}
prometheus.remote_write "prom" {
    endpoint {
        url = "http://kps-prometheus.monitoring:9090/api/v1/write"
    }
}
otelcol.auth.basic "default" {
    username = env("CTA_BASIC_AUTH_USERNAME")
    password = env("CTA_BASIC_AUTH_PASSWORD")
}

Logs

No response

aerfio commented 7 months ago

Update: I've minimized my config to this:

logging {
    format = "json"
    level  = "info"
}
prometheus.operator.servicemonitors "services" {
    forward_to = [otelcol.receiver.prometheus.default.receiver]
    clustering {
        enabled = false
    }
}
otelcol.receiver.prometheus "default" {
    output {
        metrics = [otelcol.exporter.prometheus.prom.input]
    }
}
otelcol.exporter.prometheus "prom" {
    forward_to = [prometheus.remote_write.prom.receiver]
}
prometheus.remote_write "prom" {
    endpoint {
        url = "http://kps-prometheus.monitoring:9090/api/v1/write"
    }
}

and following grafana-agent values.yaml:

fullnameOverride: graf-ag
crds:
  create: false
controller:
  type: deployment
agent:
  clustering:
    enabled: false
  configMap:
    create: false
    name: grafana-agent-config # configmap created manually with config mentioned before
    key: config.river
  resources:
    limits:
      cpu: "1"
      memory: 3Gi
    requests:
      cpu: "1"
      memory: 3Gi
serviceMonitor:
  enabled: true

And I still caught this bug:

image

Pods did not restart and they do not exceed k8s limits:

image

Newest grafana-agent version of course:

image
aerfio commented 7 months ago

Update: I've minified the grafana-agent config even further which fixed the situation, metric values look ok:

logging {
  format = "json"
  level  = "info"
}
prometheus.operator.servicemonitors "services" {
    forward_to = [prometheus.remote_write.prom.receiver]
    clustering {
        enabled = false
    }
}
prometheus.remote_write "prom" {
    endpoint {
        url = "http://kps-prometheus.monitoring:9090/api/v1/write"
    }
}

In comparison to the from my previous comment I've removed:

  1. otelcol.receiver.prometheus
  2. otelcol.exporter.prometheus

EDIT: a bit more context why in the previous example I had a pipeline with otelcol.receiver.prometheus -> otelcol.exporter.prometheus, which seemed redundant - in production setup we deploy Grafana Agent into tenent cluster, we scrape metrics using prometheus.operator.servicemonitors, gather logs and traces and turn all 3 into OTLP format and send it to management cluster, in which we have another Grafana Agent instance, which turns logs and metrics into "native" format and sends them to Loki and Mimir, traces are send without any conversion. That's why I've left otelcol.receiver.prometheus andotelcol.exporter.prometheus in the config in my previous comment. prometheus.operator.servicemonitors + otelcol.receiver.prometheus was already there and otelcol.exporter.prometheus + prometheus.remote_write were added so I could push metrics to local Prometheus instance. I just realised today that having both of those components doesnt make any sense. I'll try to retry our tests in following days to ensure that the fact that the river config from this post is 100% not buggy, but for now it seems like it.

rfratto commented 7 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

andres-garcia-webbeds commented 7 months ago

It's also happening to us, but we're using a different configuration. In our case we're using Grafana Agent Flow in EC2 ASG.

The service runs normally, without spikes and no performance issues (or at least this is what we can observe), but suddenly it starts to spike with nonsense values. We've been and we're still trying to debug what should be causing this issue but we don't understand what can be the issue. Also, I need to mention that we have different AWS accounts and different cases of Grafana Agent running there (2 on EKS, 6 on EC2), and this is the only environment where we're hitting this issue...

We're running this configuration :

logging {
  level = "info"
  format = "logfmt"
  write_to = [loki.write.loki.receiver]
}

prometheus.exporter.unix "default" {
        include_exporter_metrics = true
        disable_collectors       = ["mdadm"]
}

discovery.relabel "relabel" {
  targets = discovery.consul.consul.targets

  rule {
    action        = "drop"
    regex         = "blabla"
    source_labels = ["type"]
  }

  rule {
    action        = "replace"
    source_labels = ["__meta_consul_service_metadata_path"]
    regex         = "(.+)"
    target_label  = "__metrics_path__"
  }
  rule {
    action        = "replace"
    source_labels = ["__meta_consul_service_metadata_type"]
    regex         = "(.+)"
    target_label  = "type"
  }
  rule {
    action        = "replace"
    source_labels = ["__meta_consul_dc"]
    regex         = "(.+)"
    target_label  = "datacenter"
  }
  rule {
    action        = "replace"
    source_labels = ["__meta_consul_node"]
    regex         = "(.+)"
    target_label  = "nodename"
  }
  rule {
    action        = "replace"
    source_labels = ["__meta_consul_service"]
    regex         = "(.+)"
    target_label  = "service"
  }
  rule {
    action        = "replace"
    source_labels = ["__meta_consul_metadata_type"]
    regex         = "(.+)"
    target_label  = "type"
  }
  rule {
    action        = "replace"
    source_labels = ["__meta_consul_metadata_version"]
    regex         = "(.+)"
    target_label  = "version"
  }
  rule {
    action        = "replace"
    source_labels = ["__meta_consul_service_metadata_service"]
    regex         = "(.+)"
    target_label  = "svc"
  }
  rule {
    action        = "drop"
    regex         = "blabla"
    source_labels = ["__name__"]
  }
}

prometheus.remote_write "mimir" {

        external_labels = {
          sender = "grafana-agent",
        }

        wal {
          truncate_frequency = "15m"
          min_keepalive_time = "3m"
          max_keepalive_time = "1h"
        }

        endpoint {
                url = "https://mimir-gateway/api/v1/push"

                basic_auth {
            username = "user"
            password = "password"
                }

                queue_config {
                    sample_age_limit = "300s"
                    max_shards = 20
                    capacity = 4000
                    max_samples_per_send = 2000
                }

                write_relabel_config {
                        action = "keep"
                        source_labels = ["__name__"]

                        regex = ".+:.+|_blablabla_.+|"

                }
        }
}

discovery.consul "consul" {
        server = "127.0.0.1:8500"
        tags = ["monitoring"]
}

prometheus.scrape "from_consul" {
        clustering {
            enabled = true
        }
        targets = discovery.relabel.relabel.output
        forward_to = [prometheus.remote_write.mimir.receiver]
        job_name = "monitoring"
        scrape_interval = "30s"
}

loki.write "loki" {
  endpoint {
    url = "https://loki-gateway/loki/api/v1/push"

    basic_auth {
            username = "user"
            password = "password"
    }

  }
  external_labels = {
    cluster = "cluster",
  }
}

One example of what are we getting image

github-actions[bot] commented 6 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

taind772 commented 5 months ago

It's also happening to us, we're using Grafana Alloy v1.1 (Grafana Agent Flow before)

We scrape Kubernetes metrics and send them directly to Mimir. I'm not sure the abnormal values were caused by Alloy or Mimir though. The abnormal behavior seems to happen when our Mimir cluster has an outage, and disappears when the Mimir cluster recover, but we can't get ri

dansimov04012022 commented 3 months ago

We have the same issue with the next metrics flow:

Prometheus Server remote-write receive → → Prometheus Server remote_write send → → prometheus.receive_http → → otelcol.receiver.prometheus → → otelcol.processor.transform → → otelcol.exporter.otlphttp → → NewRelic OTLP API metrics endpoint

prometheus.receive_http "api" {
  http {
    listen_address = "0.0.0.0"
    listen_port = 9999
  }
  forward_to = [otelcol.receiver.prometheus.default.receiver]
}

otelcol.receiver.prometheus "default" {
  output {
    metrics = [otelcol.processor.transform.default.input]
  }
}

otelcol.processor.transform "default" {
  metric_statements {
    context = "resource"
    statements = [
      `delete_key(attributes, "service.instance.id")`,
      `delete_key(attributes, "service.name")`,
      `delete_key(attributes, "net.host.port")`,
      `delete_key(attributes, "net.host.name")`,
    ]
  }
  metric_statements {
    context = "datapoint"
    statements = [
      `set(attributes["tags.LabelWithDotPrefix"], attributes["LabelWithoutDotPrefix"])`,
      `delete_key(attributes, "LabelWithoutDotPrefix")`,
    ]
  }
  output {
    metrics = [otelcol.exporter.otlphttp.newrelic.input]
  }
}

otelcol.exporter.otlphttp "newrelic" {
  client {
    endpoint = "https://otlp.nr-data.net"
    headers = {
      "api-key" = "*******",
    }
  }
}

Metric values seem to be fine for gauge type, but for the counter, they're huge (e.g. node_cpu_seconds_total).

Metric values are also fine if we send them directly to the NewRelic metrics API with Prometheus Server remote_write.

wildum commented 3 months ago

Metric values are also fine if we send them directly to the NewRelic metrics API with Prometheus Server remote_write.

if you send the metrics with prometheus.remove_write with Alloy in parallel to the otel pipeline, do you get the high values for both?

dansimov04012022 commented 3 months ago

@wildum thanks for the suggestion. I checked that - metrics are fine this way.

dansimov04012022 commented 3 months ago

@wildum metrics are also fine if I send them this way, converting to OTLP and then back to Prometheus format:

Prometheus Server remote-write receive → → Prometheus Server remote_write send → → prometheus.receive_http → → otelcol.receiver.prometheus → → otelcol.processor.transform → → otelcol.exporter.prometheus → → prometheus.remote_write → → NewRelic metrics API endpoint

dansimov04012022 commented 3 months ago

From otelcol.processor.transform live debug it shows that node_cpu_seconds_total metric has the type of GAUGE:

Metric #9
Descriptor:
     -> Name: node_cpu_seconds_total
     -> Description: 
     -> Unit: 
     -> DataType: Gauge

While node_exporter tells that it's a COUNTER:

# TYPE node_cpu_seconds_total counter

From this document, I understand that it's expected, but the value I'm getting isn't even close to what I get when I apply rate() to counter metric.