grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
977 stars 103 forks source link

When using clustering, exporters may not work correctly due to `instance` label #1009

Open thampiotr opened 2 weeks ago

thampiotr commented 2 weeks ago

What's wrong?

Most Prometheus exporters set the instance label to the hostname where Alloy runs.

This breaks in a subtle, but significant way, the fundamental clustering assumption that all instances have the same configuration. The exporters implicitly inject the hostname as a label, but instances may have different hostnames. This leads to either no scraping of metrics at all, or duplicate scraping with different instance labels (unnecessary).

Steps to reproduce

  1. Run any exporter in a clustered mode in a cluster of 2+ instances, each running on a different host. Have scraping set up with clustering and a remote write to a metrics DB.
  2. Observe that some targets will not be scraped at all, some will be scraped multiple times, with different instance labels.
  3. Observe in the UI that instance label is different in exporters' targets between instances, indicating different series.

The issue was discussed in this PR, but decided to move the conversation here for better tracking and to provide a place to refer to for workarounds.

thampiotr commented 2 weeks ago

There is a workaround for now: set the instance label to a common value for all instances in the cluster, using discovery.relabel component. For example, this component sets it to "alloy-cluster":

discovery.relabel "replace_instance" {
  targets = discovery.file.targets.targets
  rule {
    action        = "replace"
    source_labels = ["instance"]
    target_label  = "instance"
    replacement   = "alloy-cluster"
  }  
}

You'd add the above component between your exporters and the prometheus.scrape.

Longer term fix can be also achieved via https://github.com/grafana/alloy/issues/399. Regardless, we should have good documentation to ensure users don't fall into this pit.