grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.33k stars 182 forks source link

Clustering enabled, but Metrics still being double collected #1611

Open SahilHakimiUofT opened 1 month ago

SahilHakimiUofT commented 1 month ago

What's wrong?

Steps to reproduce

Here is my helm configuration:

alloy:
  alloy:
    clustering:
      enabled: true
    configMap:
      content: |-
        logging {
          level = "info"
          format = "logfmt"
        }

        remote.kubernetes.secret "credentials" {
          namespace = "monitoring"
          name = "prom-basic-auth-secret"
        }

        discovery.kubernetes "alloyPodsDiscovery" {
          role = "pod"
        }

         discovery.relabel "replace_instance" {
          targets = discovery.kubernetes.alloyPodsDiscovery.targets
          rule {
            action        = "replace"
            source_labels = ["instance"]
            target_label  = "instance"
          replacement   = "alloy-cluster"
          }  
        }

        prometheus.scrape "pods" {
          targets    = discovery.relabel.replace_instance.output
          forward_to = [prometheus.remote_write.mimir.receiver]
          scrape_interval = "60s"
           clustering{
            enabled = true
          }
        }

        prometheus.remote_write "mimir" {
          external_labels = {
            cluster = "<redacted>",
          }
          endpoint {
            url = "<redacted>"
            remote_timeout = "2m"
            queue_config {
              capacity = 2500
              max_samples_per_send = 500
              max_shards = 5
            }
            write_relabel_config{
              action = "drop"
              regex = "container_network.*|container_fs.*|container_blkio_device_usage_total|csi_operations_seconds_bucket|rest_client_request_duration_seconds_bucket|apiserver_request_duration_seconds_bucket|etcd_request_duration_seconds_bucket|storage_operation_duration_seconds_bucket|container_ulimits_soft"
              source_labels = ["__name__"]
            }
            basic_auth {
              username = nonsensitive(remote.kubernetes.secret.credentials.data["username"])
              password = remote.kubernetes.secret.credentials.data["password"]
            }
          }
        }

System information

No response

Software version

Grafana alloy v1.3.1

Configuration

alloy: alloy: clustering: enabled: true configMap: content: |- logging { level = "info" format = "logfmt" }

    remote.kubernetes.secret "credentials" {
      namespace = "monitoring"
      name = "prom-basic-auth-secret"
    }

    discovery.kubernetes "alloyPodsDiscovery" {
      role = "pod"
    }

     discovery.relabel "replace_instance" {
      targets = discovery.kubernetes.alloyPodsDiscovery.targets
      rule {
        action        = "replace"
        source_labels = ["instance"]
        target_label  = "instance"
      replacement   = "alloy-cluster"
      }  
    }

    prometheus.scrape "pods" {
      targets    = discovery.relabel.replace_instance.output
      forward_to = [prometheus.remote_write.mimir.receiver]
      scrape_interval = "60s"
       clustering{
        enabled = true
      }
    }

    prometheus.remote_write "mimir" {
      external_labels = {
        cluster = "<redacted>",
      }
      endpoint {
        url = "<redacted>"
        remote_timeout = "2m"
        queue_config {
          capacity = 2500
          max_samples_per_send = 500
          max_shards = 5
        }
        write_relabel_config{
          action = "drop"
          regex = "container_network.*|container_fs.*|container_blkio_device_usage_total|csi_operations_seconds_bucket|rest_client_request_duration_seconds_bucket|apiserver_request_duration_seconds_bucket|etcd_request_duration_seconds_bucket|storage_operation_duration_seconds_bucket|container_ulimits_soft"
          source_labels = ["__name__"]
        }
        basic_auth {
          username = nonsensitive(remote.kubernetes.secret.credentials.data["username"])
          password = remote.kubernetes.secret.credentials.data["password"]
        }
      }
    }

Logs

ts=2024-09-04T16:21:21.005501451Z level=error msg="non-recoverable error" component_path=/ component_id=prometheus.remote_write.mimir subcomponent=rw remote_name=1c21e0 url=<redacted> count=500 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=anonymous: the sample has been rejected because another sample with a more recent timestamp has already been ingested and out-of-order samples are not allowed (err-mimir-sample-out-of-order). The affected sample has timestamp 2024-09-04T16:21:19.73Z and is from series {__name__=\"response_latency_ms_bucket\", authz_kind=\"default\", authz_name=\"all-unauthenticated\", client_id=\"prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local\", cluster=\"<redacted>\", direction=\"inbound\", instance=\"alloy-cluster\", job=\"prometheus.scrape.pods\", le=\"300\", route_kind=\"default\", route_name=\"default\", srv_kind=\"default\", srv_name=\"all-unauthenticated\", status_code=\"200\", target_addr=\"<redacted>1\", target_ip=\"0.0.0.0\", target_port=\"4191\", tls=\"true\"}"

eric-engberg commented 6 days ago

I'm having the same issue. Using a daemonset with clustering. I don't see any errors that indicate any alloy instance is failing to send metrics but somehow getting out of order results. With clustering enabled I'd expect the target to be handled by a single alloy instance and it never change and that a previous failed write would be tried before sending a new metric. Though like I said I don't even see any failed writes.