grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.01k stars 103 forks source link

Alloy DaemonSet stopped working properly with `was collected before with the same name and label values` #1179

Open alex-berger opened 1 week ago

alex-berger commented 1 week ago

What's wrong?

A few days ago, suddenly some of my Alloy DameonSet Pods stopped working properly, resulting in server errors (500) when accessing their /metrics endpoint. The DaemonSet was still running, no container crashes/restarts (readinessProbe was doing fine), but it appeared that it was no longer properly working.

Looks like this was caused by errors of the kind was collected before with the same name and label values. I have not observed this before and after restarting the Pods the problem disappeared. Therefore, I suspect that there must be some kind of race-condition bug in Alloy. Here an example of the output from the /metrics endpoint.

An error has occurred while serving metrics:

15 error(s) occurred:
* collected metric "workload_log_lines_total" { label:{name:"app" value:"loki"} label:{name:"component" value:"gateway"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"nginx"} label:{name:"controller" value:"loki-gateway-58bdc5cfc9"} label:{name:"controller_kind" value:"ReplicaSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"loki-system-loki-distributed"} label:{name:"ip" value:"10.176.44.205"} label:{name:"job" value:"loki-system/loki-gateway-58bdc5cfc9-xt256"} label:{name:"namespace" value:"loki-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"loki-gateway-58bdc5cfc9-xt256"} label:{name:"stream" value:"stderr"} counter:{value:1682 created_timestamp:{seconds:1719579313 nanos:848869559}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"cilium-agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"cilium-agent"} label:{name:"controller" value:"cilium"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"ip" value:"10.176.45.209"} label:{name:"job" value:"kube-system/cilium-9gcgw"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"part_of" value:"cilium"} label:{name:"pod" value:"cilium-9gcgw"} label:{name:"stream" value:"stderr"} counter:{value:104 created_timestamp:{seconds:1719579314 nanos:850782828}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:582 created_timestamp:{seconds:1719579319 nanos:108932331}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"trace-agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:55 created_timestamp:{seconds:1719579370 nanos:107981085}}} was collected before with the same name and label values
...

What immediately caught my attention is, that the affected metrics are all from stage.metrics blocks (inside a loki.process component). See below configuration for more details.

Steps to reproduce

This happens sporadically once in a while and I have not yet figured out how to reliably reproduce it. That's also why I suspect that this is a kind of a "race condition" bug.

System information

Linux 6.1.92, amd64 and arm64, Bottlerocket OS 1.20.3 (aws-k8s-1.29)

Software version

Grafana Alloy 1.2.0

Configuration

logging {
  level  = "warn"
  format = "json"
}

loki.write "default" {
  endpoint {
    url = "http://loki-gateway.loki-system.svc.cluster.local/loki/api/v1/push"
    remote_timeout = "5s"
    batch_wait = "10s"
  }
  wal {
    enabled = true
    max_segment_age = "15m"
  }
}

prometheus.remote_write "default" {
    endpoint {
      url = "http://cortex-nginx.cortex-system/api/v1/push"
      remote_timeout = "5s"
      // Disable exemplars as work around for https://github.com/grafana/alloy/issues/1117
      send_exemplars = false
      metadata_config {
        send = true
      }
      queue_config {
        capacity = 10000
        max_samples_per_send = 10000
        batch_send_deadline = "5s"
        min_backoff = "30ms"
        max_backoff = "5s"
        retry_on_http_429 = true
      }
    }
    wal {
      truncate_frequency = "15m"
      max_keepalive_time = "30m"
    }
}

prometheus.exporter.self "alloy" {}

prometheus.scrape "alloy" {
  targets    = prometheus.exporter.self.alloy.targets
  forward_to = [prometheus.relabel.alloy.receiver]
  honor_labels = true
  scrape_interval = "60s"
}

prometheus.relabel "alloy" {
  forward_to = [prometheus.remote_write.default.receiver]
  rule {
    source_labels = ["instance"]
    target_label = "node_name"
  }
  rule {
    replacement = env("POD_NAME")
    target_label = "pod"
  }
  rule {
    replacement = env("POD_NAMESPACE")
    target_label = "namespace"
  }
  rule {
    action = "labeldrop"
    regex = "instance"
  }
  rule {
    source_labels = ["instance"]
    replacement = "alloy-daemonset/" + env("POD_NAME")
    target_label = "job"
  }
}

discovery.kubelet "local_pods" {
  url = "https://" + env("HOSTNAME") + ":10250"
  bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tls_config {
    ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
    insecure_skip_verify = false
  }
}

discovery.relabel "pod_logs" {
  targets = discovery.kubelet.local_pods.targets
  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_name"]
    separator     = "/"
    target_label  = "job"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
    separator     = "/"
    action        = "replace"
    replacement   = "/var/log/pods/*$1/*.log"
    target_label  = "__path__"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_container_id"]
    regex = "^(\\w+):\\/\\/.+$"
    replacement = "$1"
    target_label = "tmp_container_runtime"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_ip"]
        target_label = "ip"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_host_ip"]
        target_label = "host_ip"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_node_name"]
        target_label = "node_name"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_controller_kind"]
        target_label = "controller_kind"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_controller_name"]
        target_label = "controller"
  }
  rule {
        action = "replace"
        regex = "^(.+?)(-[a-z0-9]+){0,2}$"
        source_labels = ["__meta_kubernetes_pod_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "^(.+?)(-[a-z0-9]+){0,1}$"
        source_labels = ["__meta_kubernetes_pod_controller_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_component"]
        target_label = "component"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_component"]
        target_label = "component"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_version"]
        target_label = "version"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_part_of"]
        target_label = "part_of"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_instance"]
        target_label = "instance"
  }
}

local.file_match "pod_logs" {
  path_targets = discovery.relabel.pod_logs.output
}

loki.source.file "pod_logs" {
  targets    = local.file_match.pod_logs.targets
  forward_to = [loki.process.pod_logs.receiver]
}

// basic processing to parse the container format. You can add additional processing stages
// to match your application logs.
loki.process "pod_logs" {
  stage.match {
    selector = "{tmp_container_runtime=\"containerd\"}"
    // the cri processing stage extracts the following k/v pairs: log, stream, time, flags
    stage.cri {}
    // Set the extract flags and stream values as labels
    stage.labels {
      values = {
        flags   = "",
        stream  = "",
      }
    }
  }

  // drop the temporary container runtime label as it is no longer needed
  stage.label_drop {
    values = ["tmp_container_runtime", "filename"]
  }

  // Remove the trailing newlines from JSON.
  // While trailing newlines are stripped by logcli, they would lead to intermittent empty lines for multiline logs.
  stage.replace {
    expression = "(\n)$"
    replace = ""
  }

  stage.multiline {
    firstline = "^\\x{200B}\\x{200D}\\x{200B}"
    max_wait_time = "3s"
  }

  stage.metrics {
    metric.counter {
        name        = "log_lines_total"
        description = "total number of log lines"
        prefix      = "workload_"
        match_all         = true
        action            = "inc"
        max_idle_duration = "1h"
    }
    metric.counter {
        name        = "raw_log_bytes_total"
        description = "total bytes of log lines before truncating long lines"
        prefix      = "workload_"
        match_all         = true
        count_entry_bytes = true
        action            = "add"
        max_idle_duration = "1h"
    }
  }

  /* //TODO enable this later
  // Truncate log lines longer than 256KiB
  stage.replace {
    expression = ".+"
    replace = "{{if gt (.Value | len) 262144 }}{{slice .Value 0 262144}}...(truncated after 262144 of {{(.Value | len)}} characters){{else}}{{.Value}}{{end}}"
  }
  */

  stage.metrics {
    metric.counter {
        name        = "truncated_log_bytes_total"
        description = "total bytes of log lines after tuncating long lines"
        prefix      = "workload_"
        match_all         = true
        count_entry_bytes = true
        action            = "add"
        max_idle_duration = "1h"
    }
  }

  forward_to = [loki.write.default.receiver]
}

Logs

<<empty log, no errors or warnings in the logs>>
alex-berger commented 1 week ago

No 100% sure yet, but it looks like I can provoke this by updating the configuration (ConfigMap), the resulting config reload seems to trigger the above described behavior. In my case I am simply changing the logging.level back and forth from info to warn and vise versa to make sure the config changes. It's still racy as not all Pods from the DaemonSet become nonfunctional, but some do.

alex-berger commented 1 week ago

As a work-around we added a livenessProbe to the Alloy DaemonSet, which will make sure the alloy container is restarted if it (the metrics endpoint) becomes unhealthy (starts responding with status code 5XX).

    containers:
    - name: alloy
      livenessProbe:
        httpGet:
          path: /metrics
          port: 12345
          scheme: HTTP
        initialDelaySeconds: 30
        timeoutSeconds: 2
        periodSeconds: 30
        successThreshold: 1
        failureThreshold: 3