Closed Jaware closed 6 years ago
Clustering is explicitly an anti-feature of Prometheus server's TSDB. Clustering creates a network dependency for alerting, which is unacceptable from our perspective.
For clustering, consider looking into https://github.com/weaveworks/cortex.
thanks
Yup. Distributed systems are an interest of mine. I would be interested in understanding your scaling needs for monitoring.
what if one prometheus instance died? according to current design,you need to start it up manually,but with clusering design,the new leader or backup will take the work automatically
That is already covered with the current H-A design. Runing two or more parallel instances with identical configuration, there is no need for a leader as they both send alerts to the alertmanager mesh. There are also several options for graph query failover.
None of this requires manual action.
really?i can not find any docs,when will this feature be ready? final release?
This has basically always been a feature. The alertmanager was originally a SPoF, but that was fixed in alertmanager 0.5.0 last year.
https://prometheus.io/docs/introduction/faq/#can-prometheus-be-made-highly-available https://www.robustperception.io/high-availability-prometheus-alerting-and-notification https://coreos.com/operators/prometheus/docs/latest/high-availability.html
e,the prometheus is not automatically...
tsdb seems not HA at this moment. One thing I guess we may do is to scrape the same metrics from two prometheus instances on different machines. That will make them out of sync very soon. Other option is to keep copying the latest updated chunks and wal file to some object stores like S3 or some other object storage/DB and get them back when disaster happens (this is what we are planning to do for HA and recovery, in my company). But it would be great if there is feature of replicating the DB in real time (same way Redis HA works for example) and keeping one of them as master and others followers and promoting one of them as master when the current one goes down.
I am very new to Prometheus and tsdb and I may be missing a lot though.
@nipuntalukdar With the Prometheus monitoring design, perfect sync of the TSDB is unnecessary. There is no need to keep any kind of hot spare or real-time sync. You simply run more than one simultaneously and they all send alerts to the alertmanager which handles the de-duplication and routing.
Standard backups can be made with snapshots, as detailed in this talk from PromCon.
Thanks @SuperQ
it seems as if the new storage engine still does not support cluster mode,so i have to use combination rather than cluster to support very large programs(one platform where all the data need to be saved together).i know you guys have worked so hard to improve performance but cluster is a must in distributed systems。hope cluster mode will be supported in the future🙃