thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.04k stars 2.09k forks source link

Cardinality explosion prevention feature. #1511

Closed improbable-ludwik closed 4 years ago

improbable-ludwik commented 5 years ago

AS A Thanos operator I WANT Thanos to actively prevent metric cardinality explosions SO THAT the system is robust to such situations.

Some ideas how this could be done: Label Value version: Thanos could implement this by respecting some configuration which would states how many distinct label values are allowed per label. As the number of label values for a label moves above this limit, new values would get counted against a placeholder/catchall label value. Metric Name/Label Name version: For example, names would be rewritten(shortened) once their right most cardinality got too high above a character length.

bwplotka commented 5 years ago

Thanks!

I guess you mean ingestion configuration, right? So during metric collection?

This means that this feature request should be rather directed to https://github.com/prometheus/prometheus Github Issues or mailing list (even better): https://groups.google.com/forum/#!forum/prometheus-users Anyway, it's definitely something that is worth discussing on Prometheus project side, so I would start this topic there first (if no discussion like this was ever started)

improbable-ludwik commented 5 years ago

I agree that Prometheus could be a place to do this. However, with prometheus only at the leafs of the collection pipeline (not seeing the full picture of a cardinality explosion across collection leafs) and becoming less and less relavant (with storage periods going to zero), I think Thanos could be well positioned to pick up this feature. This could be a store feature, or a pre-processor to store (a prom sidecar sidecar?) or even a compactor feature (re-writing metrics and reducing their cardinality in long term storage). Crucially, a global view of metrics/labels would be beneficial to make appropriate determinations about their cardinality, which is something a single prometheus instance should probably not be concerned about. I could see this being a gradual process, with more and more cardinality removal as metrics age, so immediate reduction/protection in prom, than further when placed into storeage , then further as time passes during compactor time.

bwplotka commented 5 years ago

Cool, detection is quite easy, and indeed global component has wider knowledge. Also note that Store can be not global as well, can be scoped just to one bucket.

However, prevention and mitigation are tricky. How do you know what to do?

Anyway, it's definitely good start to discuss this, but prevention bit is quite unsure. It's similar to rate limiting on remote write (streaming) so we might handle this in similar way. But again doing this on leafs (shutting down problematic targets etc) would be nice, so that's why I am suggesting talking with Prometheus users/devs - potentially on the email list. There might be plenty of ideas already for this (:

brancz commented 5 years ago

An explosion would always be local to a target no? There are metrics in Prometheus now that track metrics added and marked stale per scrape config (or target? I don't remember), those can be used to create an alert to know when it is happening. No matter how we spin it, there will always need to be a cleanup task afterwards, so in terms of Thanos the way I see it is just that we need to be able to delete metrics/time-series.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.