grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.45k stars 216 forks source link

"prometheus_remote_storage_samples_pending" is missing for prometheus.write.queue #2137

Open freak12techno opened 4 days ago

freak12techno commented 4 days ago

What's wrong?

We're using prometheus.write.queue for a persistent metrics gathering and sending them to a remote storage once the device is online (as in, has internet). We want to have an overview whether Alloy is successfully able to send all of the metrics and to have a metrics that shows how many samples are there that are collected but not yet sent to a remote storage (important to note that this metrics shouldn't be a counter which means how many samples were gathered and saved to WAL, but a real number or estimate at least of samples that were saved to WAL, but not sent, in order to properly see it even when Alloy restarts).

For us it's crucial as we need to understand how much of the samples are scraped, but not yet sent, as we plan to use it as a base for alerting and observation.

The docs are saying that there's this metric:

and it seems to be something that we want, but apparently it's somehow not being returned by Alloy meta-monitoring (I cannot find it in Grafana using Prometheus this Alloy is writing to as a datasource).

Is it supposed to be there? If not I suggest to update the docs to remove it and add a metric that displays the amount of samples that are stored, but not yet sent (probably a gauge as it also can decrease, unlike counter). If yes, should there be any additional steps to enable it? (Probably need to put it into docs if so).

Steps to reproduce

  1. Run Alloy with the config similar to what I provided below.
  2. See no prometheus_remote_storage_samples_pending metric being written into Prometheus.

System information

No response

Software version

1.5.0

Configuration

logging {
  level  = "info"
  format = "logfmt"
}

// System metrics, from node_exporter
prometheus.exporter.unix "node_exporter" { }

// Alloy built-in metrics
prometheus.exporter.self "alloy" { }

// CAdvisor
prometheus.exporter.cadvisor "cadvisor" {
  docker_host = "unix:///var/run/docker.sock"
}

// Metrics scrape configuration
prometheus.scrape "node_exporter" {
  targets = array.concat(
    // Scraping node_exporter
    prometheus.exporter.unix.node_exporter.targets,
    // Scraping Alloy built-in metrics
    prometheus.exporter.self.alloy.targets,
    // Scraping CAdvisor metrics
    prometheus.exporter.cadvisor.cadvisor.targets,
  )

  scrape_interval = "60s"
  honor_labels    = true

  // Sending these scraped metrics to remote Prometheus via prometheus.write.queue.
  forward_to = [prometheus.write.queue.default.receiver]
}

prometheus.write.queue "default" {
  endpoint "default"{
    url = env("PROMETHEUS_HOST")
    bearer_token = env("PROMETHEUS_TOKEN")
  }

  // Keep 1 week of data, in case it wasn't sent.
  // More on WAL and its internals:
  // https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.remote_write/#wal-block
  ttl = "168h"

  persistence {
    batch_interval = "10s"
  }
}

Logs

freak12techno commented 4 days ago

@mattdurham I guess you're the person to ask on that? can you chime in?

mattdurham commented 3 days ago

At the moment this would be a documentation issue. Will submit a PR to remove that. Right now checking the in vs out timestamp is likely the best rate. Adding a more bespoke metric probably makes sense, but would need to figure out how to not make it to chatty.

freak12techno commented 3 days ago

Yeah we'll use the metric that you suggested for now, but ideally it should be something that reflects how much samples there are that are not yet sent to the remote storage. So just stating that this is a metric that we want to be available and we need it (and likely we won't be the only ones who need it), so would be lovely if that would be implemented at some point.

Also I am not yet sure if the samples are always sent in order (as in, the older sample always gets sent first than the newer one)? If not, the timestamps difference metric might report the wrong results I guess.

mattdurham commented 3 days ago

The guarantee is that for a given series they are sent in timestamped order, in practice the timestamps are close enough. Generally I would consider anything under a minute lag time as good, though you could likely shave that a bit lower depending on your use case.

I concur that it would be good to have though. If you want to submit a PR it would be welcome github.com/alloy/walqueue, I will be unavailable for a good chunk of the rest of the year but if its still outstanding when I get back I will code it in.