Closed ccmtaylor closed 2 years ago
Note: some other metrics (i.e. vector_component_received_event_bytes_total
, vector_checksum_errors_total
, vector_component_received_events_total
) include the pod name of the pods that produce logs as a tag. This technically also causes unbounded cardinality, though at a much lower rate which is fine for our use case.
No, this is being discussed on vector https://github.com/vectordotdev/vector/issues/11995 and dropping the tags (with a transform) will not help with the amount of metrics being collected from internal_metrics, only reduce the data sent out at the sink. Any fix on the chart itself won't be able to help in this case.
Sorry - this clearly passed my notice! @tuananhnguyen-ct is correct, dropping the tags (or the tag_cardinality_limit transform) should protect your downstream Prometheus from cardinality issues. However Vector will still be tracking things internally and we'll need to solve this in a more complete fashion.
I'll close this as a duplicate of https://github.com/vectordotdev/vector/issues/11995.
Thanks for the replies and pointers! I’ve subscribed to the upstream issue.
On 3. Aug 2022, at 19:17, Spencer Gilbert @.***> wrote:
Sorry - this clearly passed my notice! @tuananhnguyen-ct is correct, dropping the tags (or the tag_cardinality_limit transform) should protect your downstream Prometheus from cardinality issues. However Vector will still be tracking things internally and we'll need to solve this in a more complete fashion.
I'll close this as a duplicate of vectordotdev/vector#11995.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
We recently deployed a vector agent based on the default configuration to a relatively busy Kubernetes cluster (~300 nodes, ~8000 pods). Some of the metrics have unbounded cardinality on some of the tags.
In particular, the file-based metrics (
vector_files_added_total
,vector_files_unwatched_total
) have afile
tag, causing their cardinality to reach millions of time series over a couple of days. This had a noticeable performance impact on the overall observability infrastructure (based on Prometheus/Thanos).As a workaround, we're including the following remap transform in our
customConfig
:Would it make sense to include this in the default configuration?