prometheus-junkyard / tsdb

The Prometheus time series database layer.
Apache License 2.0
835 stars 180 forks source link

cluster mode #132

Closed Jaware closed 6 years ago

Jaware commented 6 years ago

it seems as if the new storage engine still does not support cluster mode,so i have to use combination rather than cluster to support very large programs(one platform where all the data need to be saved together).i know you guys have worked so hard to improve performance but cluster is a must in distributed systems。hope cluster mode will be supported in the future🙃

SuperQ commented 6 years ago

Clustering is explicitly an anti-feature of Prometheus server's TSDB. Clustering creates a network dependency for alerting, which is unacceptable from our perspective.

For clustering, consider looking into https://github.com/weaveworks/cortex.

Jaware commented 6 years ago

thanks

SuperQ commented 6 years ago

Yup. Distributed systems are an interest of mine. I would be interested in understanding your scaling needs for monitoring.

Jaware commented 6 years ago

what if one prometheus instance died? according to current design,you need to start it up manually,but with clusering design,the new leader or backup will take the work automatically

SuperQ commented 6 years ago

That is already covered with the current H-A design. Runing two or more parallel instances with identical configuration, there is no need for a leader as they both send alerts to the alertmanager mesh. There are also several options for graph query failover.

None of this requires manual action.

Jaware commented 6 years ago

really?i can not find any docs,when will this feature be ready? final release?

SuperQ commented 6 years ago

This has basically always been a feature. The alertmanager was originally a SPoF, but that was fixed in alertmanager 0.5.0 last year.

https://prometheus.io/docs/introduction/faq/#can-prometheus-be-made-highly-available https://www.robustperception.io/high-availability-prometheus-alerting-and-notification https://coreos.com/operators/prometheus/docs/latest/high-availability.html

Jaware commented 6 years ago

e,the prometheus is not automatically...

nipuntalukdar commented 6 years ago

tsdb seems not HA at this moment. One thing I guess we may do is to scrape the same metrics from two prometheus instances on different machines. That will make them out of sync very soon. Other option is to keep copying the latest updated chunks and wal file to some object stores like S3 or some other object storage/DB and get them back when disaster happens (this is what we are planning to do for HA and recovery, in my company). But it would be great if there is feature of replicating the DB in real time (same way Redis HA works for example) and keeping one of them as master and others followers and promoting one of them as master when the current one goes down.

I am very new to Prometheus and tsdb and I may be missing a lot though.

SuperQ commented 6 years ago

@nipuntalukdar With the Prometheus monitoring design, perfect sync of the TSDB is unnecessary. There is no need to keep any kind of hot spare or real-time sync. You simply run more than one simultaneously and they all send alerts to the alertmanager which handles the de-duplication and routing.

Standard backups can be made with snapshots, as detailed in this talk from PromCon.

https://youtu.be/15uc8oTMgPY

nipuntalukdar commented 6 years ago

Thanks @SuperQ