default agent config causes metric cardinality explosion in Prometheus

ccmtaylor commented 2 years ago

We recently deployed a vector agent based on the default configuration to a relatively busy Kubernetes cluster (~300 nodes, ~8000 pods). Some of the metrics have unbounded cardinality on some of the tags.

In particular, the file-based metrics (vector_files_added_total, vector_files_unwatched_total) have a file tag, causing their cardinality to reach millions of time series over a couple of days. This had a noticeable performance impact on the overall observability infrastructure (based on Prometheus/Thanos).

As a workaround, we're including the following remap transform in our customConfig:

  transforms:
    reduce_metrics:
      type: remap
      inputs: [internal_metrics]
      # The file tag contains the (never re-used) file names. This causes a
      # high cardinality of "file added" and "file removed" counters, each at a
      # value of 1. Remove the tag so that we count each of these events into
      # the overall metric.
      source: 'del(.tags.file)'
  sinks:
    prom_exporter:
      type: prometheus_exporter
      inputs: [reduce_metrics]
      address: 0.0.0.0:9090

Would it make sense to include this in the default configuration?

ccmtaylor commented 2 years ago

Note: some other metrics (i.e. vector_component_received_event_bytes_total, vector_checksum_errors_total, vector_component_received_events_total) include the pod name of the pods that produce logs as a tag. This technically also causes unbounded cardinality, though at a much lower rate which is fine for our use case.

tuananhnguyen-ct commented 2 years ago

No, this is being discussed on vector https://github.com/vectordotdev/vector/issues/11995 and dropping the tags (with a transform) will not help with the amount of metrics being collected from internal_metrics, only reduce the data sent out at the sink. Any fix on the chart itself won't be able to help in this case.

spencergilbert commented 2 years ago

Sorry - this clearly passed my notice! @tuananhnguyen-ct is correct, dropping the tags (or the tag_cardinality_limit transform) should protect your downstream Prometheus from cardinality issues. However Vector will still be tracking things internally and we'll need to solve this in a more complete fashion.

I'll close this as a duplicate of https://github.com/vectordotdev/vector/issues/11995.

ccmtaylor commented 2 years ago

Thanks for the replies and pointers! I’ve subscribed to the upstream issue.

On 3. Aug 2022, at 19:17, Spencer Gilbert @.***> wrote:

Sorry - this clearly passed my notice! @tuananhnguyen-ct is correct, dropping the tags (or the tag_cardinality_limit transform) should protect your downstream Prometheus from cardinality issues. However Vector will still be tracking things internally and we'll need to solve this in a more complete fashion.

I'll close this as a duplicate of vectordotdev/vector#11995.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

vectordotdev / helm-charts

default agent config causes metric cardinality explosion in Prometheus #229