Cardinality explosion prevention feature.

improbable-ludwik commented 5 years ago

AS A Thanos operator I WANT Thanos to actively prevent metric cardinality explosions SO THAT the system is robust to such situations.

Some ideas how this could be done: Label Value version: Thanos could implement this by respecting some configuration which would states how many distinct label values are allowed per label. As the number of label values for a label moves above this limit, new values would get counted against a placeholder/catchall label value. Metric Name/Label Name version: For example, names would be rewritten(shortened) once their right most cardinality got too high above a character length.

bwplotka commented 5 years ago

Thanks!

I guess you mean ingestion configuration, right? So during metric collection?

This means that this feature request should be rather directed to https://github.com/prometheus/prometheus Github Issues or mailing list (even better): https://groups.google.com/forum/#!forum/prometheus-users Anyway, it's definitely something that is worth discussing on Prometheus project side, so I would start this topic there first (if no discussion like this was ever started)

improbable-ludwik commented 5 years ago

I agree that Prometheus could be a place to do this. However, with prometheus only at the leafs of the collection pipeline (not seeing the full picture of a cardinality explosion across collection leafs) and becoming less and less relavant (with storage periods going to zero), I think Thanos could be well positioned to pick up this feature. This could be a store feature, or a pre-processor to store (a prom sidecar sidecar?) or even a compactor feature (re-writing metrics and reducing their cardinality in long term storage). Crucially, a global view of metrics/labels would be beneficial to make appropriate determinations about their cardinality, which is something a single prometheus instance should probably not be concerned about. I could see this being a gradual process, with more and more cardinality removal as metrics age, so immediate reduction/protection in prom, than further when placed into storeage , then further as time passes during compactor time.

bwplotka commented 5 years ago

Cool, detection is quite easy, and indeed global component has wider knowledge. Also note that Store can be not global as well, can be scoped just to one bucket.

However, prevention and mitigation are tricky. How do you know what to do?

Should we just alert human? That will do, but for store, it will be quite late - leafs could respond immediately. For leafs it's easy to act immediately -> once spotted
Should we remove some series? Which ones? What if those are crucial to alerting? Again if we would do that on the store, it's again too late: You can OOM your scrapers (Prometheus already).
Should we just block incoming series on leafs? But again, which ones, and in this case, we are dropping data on the floor. The closest solution to this we have seen a year ago https://blog.freshtracks.io/bomb-squad-automatic-detection-and-suppression-of-prometheus-cardinality-explosions-62ca8e02fa32?gi=882310dd7718 and it was pretty interesting. Not seen any production usage of this, unfortunately.

Anyway, it's definitely good start to discuss this, but prevention bit is quite unsure. It's similar to rate limiting on remote write (streaming) so we might handle this in similar way. But again doing this on leafs (shutting down problematic targets etc) would be nice, so that's why I am suggesting talking with Prometheus users/devs - potentially on the email list. There might be plenty of ideas already for this (:

brancz commented 5 years ago

An explosion would always be local to a target no? There are metrics in Prometheus now that track metrics added and marked stale per scrape config (or target? I don't remember), those can be used to create an alert to know when it is happening. No matter how we spin it, there will always need to be a cleanup task afterwards, so in terms of Thanos the way I see it is just that we need to be able to delete metrics/time-series.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

thanos-io / thanos

Cardinality explosion prevention feature. #1511