poseidon / typhoon

Minimal and free Kubernetes distribution with Terraform
https://typhoon.psdn.io/
MIT License
1.95k stars 322 forks source link

control plane monitoring/alerting by default #87

Closed xiang90 closed 6 years ago

xiang90 commented 6 years ago

As prometheus is shipped by default, control plane monitoring should be setup by default, and documented.

dghubble commented 6 years ago

Prometheus 2.0 is available in the optional addon manifests, deployed via kubectl apply -f addons/prometheus -R. There is a tutorial at https://typhoon.psdn.io/addons/prometheus/ describing targets, alert rules, exporters, and grafana visualizations. Reasonable alerting rules are included by default and checked on all platforms. Have you gone through that tutorial?

xiang90 commented 6 years ago

Yes. I looked at the tutorial.

describing targets, alert rules,

However it does not talk anything about the rule file you linked above. Also this config file (https://github.com/poseidon/typhoon/blob/master/addons/prometheus/config.yaml) should probable be linked to?

etcd also has pre-defined FQDN in typhoon setup. Why not provide a default set of targets (default size to 3 or 5)? Or at least, we can call it out so people know the importance of monitoring it? Based on my operational experience of Kubernetes, I would always prefer monitoring etcd first, and make sure the data store is reliable.

I even think the monitoring/alert addon should be promoted to a required component if typhoon is intended to provide sane configurations for production clusters.

dghubble commented 6 years ago

If you have specific changes you'd like to see made to the alerting rules, would you mind opening a PR here and to prometheus-operator? The addon/prometheus alerts are borrowed straight from there, but modified to not rely on the operator. I'm all ears if you've got better rules in mind. cc @brancz

If you check the Prometheus config, Prometheus supports lots of service or node or blockbox discovery mechanisms. Please don't hardcode or pre-define rules for fixed FQDNs.

I respect that coming from etcd, you want users' experiences with etcd to be the best possible. However, neither Prometheus, nor any alternative, has any business as a required component. Prometheus is not essential to the integrity of the control plane, not everyone chooses Prometheus to collect metrics (resource usage is still an issue in some environments), and not everyone chooses to collect metrics at all (that's just the reality).

Moreover, making Prometheus required wouldn't actually achieve your goal. Alerts aren't going anywhere unless you deploy an alert manager. And have that alert manager hooked up to slack or pager duty. The addons aren't about trying to validate its all done right. The addons establish a nice separation between what is truly truly essential and what is an optional "addon" app. They give you some handy things to install after you create the cluster. Use em' if you want, don't if you don't. Heck, even CLUO (which auto-updates the OS) is considered an optional addon.

xiang90 commented 6 years ago

Moreover, making Prometheus required wouldn't actually achieve your goal. Alerts aren't going anywhere unless you deploy an alert manager. And have that alert manager hooked up to slack or pager duty. The addons aren't about trying to validate its all done right.

Right. That is why I want to make not only monitoring, but also alerting enabled by default for production usage.

The addons establish a nice separation between what is truly truly essential and what is an optional "addon" app. They give you some handy things to install after you create the cluster. Use em' if you want, don't if you don't. Heck, even CLUO (which auto-updates the OS) is considered an optional addon.

CLUO is OS specific, and has its own risks. So I guess it is a little bit different.

Monitoring/alerting is definitely something we should promote for a reliable k8s deployment. But I guess you have concerns on other possible monitoring solutions. Maybe people want their own prometheus configurations or even other monitoring stacks.

brancz commented 6 years ago

fwiw since typhoon is self-hosted, it just takes running the hack/ scripts to deploy the kube-prometheus stack including a standard set of alerting rules, and dashboards.

https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus

@dghubble can you elaborate on this?

The addon/prometheus alerts are borrowed straight from there, but modified to not rely on the operator. I'm all ears if you've got better rules in mind.

Specifically the "to not rely on the operator" part. Is this because the prometheus-operator is lacking features? If so please report them as then we can make sure to work on them to provide users with everything they need (we are aware that there are missing features, but have little reports on which people are missing so we have a hard time prioritizing). If there are no missing features, then I'm incredibly sad as we are putting a lot of work into the prometheus-operator and there are many features which are important for a real production environment. All of the Prometheus upstream team stands behind the prometheus-operator as the recommended way to run Prometheus on Kubernetes.

dghubble commented 6 years ago

Deploying the kube-prometheus manifests directly, several (~6 I believe) alert rules don't correspond to how auto-discovered metrics are named (see config). Minor tweaks have been made to those names and the DeadManSwitch removed.

Overall, Prometheus itself has a reasonable config file format and auto-discovery mechanisms (beyond just Kubernetes) that work well enough for them to be recommended here. I've chosen to recommend it be deployed that way rather than deal with the extra complexity, indirection, and Kubernetes-specific aspects of the operator. The operator is opaque compared with plain-old Kubernetes manifests and I've also taken the stance that apps (such as Prometheus) should be cautious about building a dependency on Kubernetes itself. That's a matter of taste and choosing the right solutions for our infrastructure, rather than specific missing features.

I don't have any comments on what Prometheus team chooses to prioritize and I look forward the teams' continued work on prometheus-operator and the benefits back to Prometheus itself.

At the end of the day, Typhoon addons are just well-chosen examples of useful applications one can run on clusters, they're not core to the distribution. Folks can deploy some addons, all addons, none of them, etc. As you mentioned, nothing precludes users from deploying prometheus-operator on Typhoon. Should work just as well as it does on any Kubernetes cluster. If things evolve, the operator could potentially become part of addons.

dghubble commented 6 years ago

Thanks for the comments guys. This issue spans the gamut of etcd alerting rules, why Prometheus is an addon, and why not prometheus-operator. Again, if there are specific alerting rule changes you'd like to see for etcd, let's have an issue or PR to that effect. Otherwise, this issue isn't tracking anything concrete.

dghubble commented 6 years ago

I've distilled this into a concrete action item here https://github.com/poseidon/typhoon/issues/114

Goal is to have etcd scrapes (and thus alerts and dashboards) work for anyone choosing to kubectl apply the Prometheus addon.