Closed RayHuangCN closed 3 years ago
Cool, this is quite an amazing user story. (:
We try similar with @domgreen at some point and it sometimes is hard to debug where the scrape target is etc, but for the good use case, that might be extremely helpful.
This issue will get lost, so I would love to see this in form of blog post, so we can link this on our website somewhere OR we could add some new page similar to https://thanos.io/tip/operating/reverse-proxy.md/ but for e.g Use Cases
? :thinking:
FWIW Prometheus Operator uses a pretty similar strategy, except it generates a discovery config that distributes the targets via hashmod sharding of the __address__
label and assigns a shard to each Prometheus.
I highly recommend using hashmod
sharding, because if you use the coordinator to assign targets, you have no starting point of troubleshooting when something is not discovered, which makes this essentially an unpredictable distributed monitoring system. As a plus, that way you don't need any coordinator, except for distributing the configuration.
@brancz Thanks for your opinion๐. The solution of Prometheus Operator is cool when targets have similar series scale๐.
But that may lead to load imbalance if targets with huge differences series. For example, when cluster size is large, the kube-state-metrics may has more than 3000000 series, we use kube-state-metrics shading, and have 4 or more kube-state-metrics shards. If we use hashmod
by __address__
, these kube-state-metrics shards may be distributed to the same Prometheus shard and cause OOM, but if we use Coordinator to explore targets scale, and then distribute targets according to real series scale, the load of Prometheus shards are controllable.
On the other hand, all shards will do service discovery if we use hashmod
only, and this may causes lots of memory waste especially when cluster is large.Only use Coordinator to do service discovery can save 50% memory usage of Prometheus shards in our case.
If any shard(all replicas) is down, Coordinator can assign targets to health shard immediately.
Coordinator known anything about service discovery and has correct result of /api/v1/targets, it also known the distribution of all targets, we will add some API and expose metrics for debugging in next release.
@bwplotka Thanks! I think adding page similar to https://thanos.io/tip/operating/reverse-proxy.md/ is good, could you please tell me where to add it ? (-:
I would add some page use-cases.md
in operating
menu. WDYT? (:
That sounds good! ๐ Anything i can help?
Of course! You are more than welcome to create PR with such a markdown page with your use case stated as one. Then team & community can add more items for the basic use case and those more advanced :hugs:
Ok๐.
@bwplotka The PR was created ( - :
Thanos is a awesome project. we use it to monitor our Kubernetes clusters for years. Recently we shared our solution for large cluster monitoring to anybody need it.
Kvass is a Prometheus horizontal auto-scaling solution, which uses Sidecar to generate special config file only contains part of targets assigned from Coordinator for every Prometheus shard. We use Thanos to get global data view. ๐
We already use Thanos + Kvass to monitor Kubernetes cluster with following size for moths. Just with one Prometheus config file as usual, no federal needed and no hash_mod needed.